<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Basic Statistics and Data Types - Sampling 
 

## Lesson Objectives

-	After completing this lesson, you should be able to:
-	Perform standard sampling on any RDD 
-	Split any RDD randomly into subsets 
-	Perform stratified sampling on RDDs of key-value pairs 


## Sampling 

-	Can be performed on any RDD 
- Returns a sampled subset of an RDD
-	Sampling with or without replacement
- Fraction:
-	without replacement - expected size of the sample as fraction of RDD's size 
-	with replacement - expected number of times each element is chosen
-	Can be used on bootstrapping procedures

In [1]:
// A Simple Sampling 

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

val elements: RDD[Vector] = sc.parallelize(Array(
    Vectors.dense(4.0,7.0,13.0),
    Vectors.dense(-2.0,8.0,4.0),
    Vectors.dense(3.0,-11.0,19.0)))

elements.sample(withReplacement=false, fraction=0.5, seed=10L).collect()


Intitializing Scala interpreter ...

Spark Web UI available at http://host.docker.internal:4043
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1669534598054)
SparkSession available as 'spark'


import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
elements: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[0] at parallelize at <console>:29
res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([3.0,-11.0,19.0])


In [2]:
elements.sample(withReplacement=false, fraction=0.5, seed=7L).collect()

res1: Array[org.apache.spark.mllib.linalg.Vector] = Array([4.0,7.0,13.0], [-2.0,8.0,4.0], [3.0,-11.0,19.0])


In [3]:
elements.sample(withReplacement=false, fraction=0.5, seed=64L).collect()

res2: Array[org.apache.spark.mllib.linalg.Vector] = Array([4.0,7.0,13.0], [3.0,-11.0,19.0])


## Random Split

-	Can be performed on any RDD
-	Returns an array of RDDs
- Weights for the split will be normalized if they do not add up to 1
-	Useful for splitting a data set into training, test and validation sets

In [4]:
val data = sc.parallelize(1 to 1000000)
val splits = data.randomSplit(Array(0.6, 0.2, 0.2), seed = 13L)

val training = splits(0)
val test = splits(1)
val validation = splits(2)

splits.map(_.count())

22/11/27 15:36:48 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped


data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:26
splits: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[5] at randomSplit at <console>:27, MapPartitionsRDD[6] at randomSplit at <console>:27, MapPartitionsRDD[7] at randomSplit at <console>:27)
training: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at randomSplit at <console>:27
test: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at randomSplit at <console>:27
validation: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at randomSplit at <console>:27
res3: Array[Long] = Array(600491, 199608, 199901)


## Stratified Sampling 

-	Can be performed on RDDs of key-value pairs 
-	Think of keys as labels and values as an specific attribute
- Two supported methods defined in `PairRDDFunctions`:
-	`sampleByKey` requires only one pass over the data and provides an expected sample size
-	`sampleByKeyExact` provides the exact sampling size with 99.99% confidence but requires significantly more resources

In [5]:
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.distributed.IndexedRow

val rows: RDD[IndexedRow] = sc.parallelize(Array(
    IndexedRow(0, Vectors.dense(1.0,2.0)),
    IndexedRow(1, Vectors.dense(4.0,5.0)),
    IndexedRow(1, Vectors.dense(7.0,8.0))))

val fractions: Map[Long, Double] = Map(0L -> 1.0, 1L -> 0.5)

val approxSample = rows.map{
    case IndexedRow(index, vec) => (index, vec)
}.sampleByKey(withReplacement = false, fractions, 9L)

approxSample.collect()

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.distributed.IndexedRow
rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.distributed.IndexedRow] = ParallelCollectionRDD[8] at parallelize at <console>:29
fractions: Map[Long,Double] = Map(0 -> 1.0, 1 -> 0.5)
approxSample: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[10] at sampleByKey at <console>:38
res4: Array[(Long, org.apache.spark.mllib.linalg.Vector)] = Array((0,[1.0,2.0]), (1,[4.0,5.0]), (1,[7.0,8.0]))


## Lesson Summary: 

-	Having completed this lesson, you should now be able to:
-	Perform standard sampling on any RDD
-	Split any RDD randomly into subsets
-	Perform stratified sampling on RDDs of key-value pairs

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.