<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Basic Statistics and Data Types - Summary Statistics, Correlations, and Random Data 

## Summary Statistics, Correlations, and Random Data 

## Lesson Objectives 

- After completing this lesson, you should be able to:
- Compute column summary statistics
- Compute pairwise correlations between series/columns
- Generate random data from different distributions 


## Summary Statistics 
- Column summary statistics for an instance of `RDD[Vector]` are available through the `colStats()` function in Statistics 
-	It returns an instance of `MultivariateStatisticalSummary`, which contains column-wise results for: 
-	`min`, `max`
-	`mean`, `variance` 
-	`numNonzeros`
-	`normL1`, `normL2`
-	`count` returns the total count of elements

In [1]:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.stat.Statistics 
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary

val observations: RDD[Vector] = sc.parallelize(Array(
    Vectors.dense(1.0, 2.0), 
    Vectors.dense(4.0, 5.0), 
    Vectors.dense(7.0, 8.0)))
    
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)

println("mean",summary.mean)

println("variance",  summary.variance)

println("Nonzeros",summary.numNonzeros)
println("L1 norm",summary.normL1)
println("L2 norm",summary.normL2)



(mean,[4.0,5.0])
(variance,[9.0,9.0])
(Nonzeros,[3.0,3.0])
(L1 norm,[12.0,15.0])(L2 norm,[8.12403840463596,9.643650760992955])


observations = ParallelCollectionRDD[0] at parallelize at <console>:33
summary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2b9d5c4d


org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2b9d5c4d

## Correlations

-	Pairwise correlations among series is available through the `corr()` function in Statistics 
-	Correlation methods supported: 
-	Pearson (default)
-	`Spearman` (used for rank variables)
-	Inputs supported: 
-	two `RDD[Double]`s, returning a single Double value
-	an `RDD[Vector],` returning a correlation Matrix 


## Pearson Correlation Between Two Series

In [2]:

val x: RDD[Double] = sc.parallelize(Array(2.0, 9.0, -7.0))
val y: RDD[Double]= sc.parallelize(Array(1.0, 3.0, 5.0))
val correlation: Double = Statistics.corr(x, y, "pearson")

correlation 

x = ParallelCollectionRDD[2] at parallelize at <console>:32
y = ParallelCollectionRDD[3] at parallelize at <console>:33
correlation = -0.5610408535732833


-0.5610408535732833

In [3]:

// Pearson Correlation among Series

val  data: RDD[Vector] = sc.parallelize(Array(
    Vectors.dense(2.0, 9.0, -7.0),
    Vectors.dense(1.0, -3.0, 5.0), 
    Vectors.dense(4.0, 0.0, -5.0)))

val correlMatrix: Matrix = Statistics.corr(data, "pearson")

correlMatrix

data = ParallelCollectionRDD[8] at parallelize at <console>:34
correlMatrix = 


1.0                  0.05241424183609593  -0.6449020216370243
0.05241424183609593  1.0                  -0.7970167702187486
-0.6449020216370243  -0.7970167702187486  1.0
1.0                  0.05241424183609593  -0.6449020216370243
0.05241424183609593  1.0                  -0.7970167702187486
-0.6449020216370243  -0.7970167702187486  1.0


In [4]:

// Pearson vs Spearman Correlation among Series 

val ranks: RDD[Vector] = sc.parallelize(Array(Vectors.dense(1.0,2.0,3.0),
    Vectors.dense(5.0,6.0,4.0), Vectors.dense(7.0,8.0,9.0)))

val corrPearsonMatrix: Matrix = Statistics.corr(ranks, "pearson")
println("Pearson")
println(corrPearsonMatrix)

val corrSpearmanMatrix: Matrix = Statistics.corr(ranks, "spearman")
println("Spearman")
println(corrSpearmanMatrix)

Pearson
1.0                 1.0000000000000002  0.8485552916276634  
1.0000000000000002  1.0                 0.8485552916276634  
0.8485552916276634  0.8485552916276634  1.0                 
Spearman
1.0  1.0  1.0  
1.0  1.0  1.0  
1.0  1.0  1.0  


ranks = ParallelCollectionRDD[11] at parallelize at <console>:34
corrPearsonMatrix = 
corrSpearmanMatrix = 


1.0                 1.0000000000000002  0.8485552916276634
1.0000000000000002  1.0                 0.8485552916276634
0.8485552916276634  0.8485552916276634  1.0
1.0  1.0  1.0
1.0  1.0  1.0
1.0  1.0  1.0


## Random Data Generation

-	RandomRDDs generate either random double RDDs or vector RDDs
-	Supported distributions: 
-	uniform, normal, lognormal, poisson, exponential, and gamma 
-	Useful for randomized algorithms, prototyping and performance testing

In [5]:
// Simple Example 
import org.apache.spark.mllib.random.RandomRDDs._

val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)

println( "mean", million.mean )
println( "variance",million.variance)

(mean,0.9998099999999975)
(variance,0.9995679639000008)


million = RandomRDD[23] at RDD at RandomRDD.scala:42


RandomRDD[23] at RDD at RandomRDD.scala:42

In [2]:
// Simple Vector Example

import org.apache.spark.mllib.random.RandomRDDs._

val data = normalVectorRDD(sc, numRows=10000L, numCols=3, numPartitions=10)
val stats: MultivariateStatisticalSummary = Statistics.colStats(data)


println( "mean", stats.mean)
println( "variance",stats.variance)

(mean,[-0.016859128507122633,0.01240208611447731,9.169106226799562E-4])
(variance,[1.0101758581424805,0.9871205692218137,0.998462162833049])


data = RandomVectorRDD[2] at RDD at RandomRDD.scala:64
stats = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2b0717dc


org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2b0717dc

## Available Distributions 
	
-	`exponentialRDD` 
-	`gammaRDD`
-	`logNormalRDD`
- `normalRDD`
-	`poissonRDD`
-	`uniformRDD`
-	`exponentialVectorRDD`
-	`gammaVectorRDD`
-	`logNormalVectorRDD`
-	`normalVectorRDD`
-	`poissonVectorRDD`
-	`uniformVectorRDD`

## Lesson Summary 

-	Having completed this lesson, you should now be able to: 
- Compute column summary statistics 
-	Compute pairwise correlations between series/columns 
-	Generate random data from different distributions

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.