<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 2: Preparing Data

## Statistics, Random data and Sampling on Data Frames

### Lesson Objectives 

After completing this lesson, you should be able to: 

- Compute column summary statistics
-	Compute pairwise statistics between series/columns
-	Perform standard sampling on any `DataFrame` 
-	Split any `DataFrame` randomly into subsets
-	Perform stratified sampling onto `DataFrames`
-	Generate Random Data from Uniform and Normal Distributions 


## Summary Statistics for DataFrames 

-	Column summary statistics for DataFrames are available through DataFrame's `describe()` method
-	It returns another `DataFrame`, which contains column-wise results for: 
-	`min`, `max`
-	`mean`, `stddev`
-	`count`
- Column summary statistics can also be computed through DataFrame's `groupBy()` and `agg()` methods, but stddev is not supported
-	It also returns another `DataFrame` with the results

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

spark = org.apache.spark.sql.SparkSession@613721a7


org.apache.spark.sql.SparkSession@613721a7

In [2]:
case class Record(desc: String, value1: Int, value2: Double)

val records = Array(Record("first",1,3.7), Record("second",-2,2.1), Record("third",6,0.7))

val recRDD = sc.parallelize(records)
val recDF = spark.createDataFrame(recRDD)
recDF.show()

val recStats = recDF.describe()
recStats.show()

+------+------+------+
|  desc|value1|value2|
+------+------+------+
| first|     1|   3.7|
|second|    -2|   2.1|
| third|     6|   0.7|
+------+------+------+

+-------+-----+------------------+------------------+
|summary| desc|            value1|            value2|
+-------+-----+------------------+------------------+
|  count|    3|                 3|                 3|
|   mean| null|1.6666666666666667| 2.166666666666667|
| stddev| null| 4.041451884327381|1.5011106998930273|
|    min|first|                -2|               0.7|
|    max|third|                 6|               3.7|
+-------+-----+------------------+------------------+



defined class Record
records = Array(Record(first,1,3.7), Record(second,-2,2.1), Record(third,6,0.7))
recRDD = ParallelCollectionRDD[0] at parallelize at <console>:21
recDF = [desc: string, value1: int ... 1 more field]
recStats = [summary: string, desc: string ... 2 more fields]


[summary: string, desc: string ... 2 more fields]

In [3]:
// Fetching Results from DataFrame

recStats.filter($"summary" === "stddev").first()
  
val ar1 = recStats.filter($"summary" === "stddev").first().toSeq.drop(2).map(_.toString.toDouble).toArray
println(ar1)
 
val ar2 = recStats.select("value1").map(s => s(0).toString.toDouble).collect()
println(ar2)

[D@70d002c3
[D@4b021f82


ar1 = Array(4.041451884327381, 1.5011106998930273)
ar2 = Array(3.0, 1.6666666666666667, 4.041451884327381, -2.0, 6.0)


Array(3.0, 1.6666666666666667, 4.041451884327381, -2.0, 6.0)

In [None]:
recDF.groupBy().agg(Map("value1" -> "min", "value1" -> "max" ))
 
recDF.groupBy().agg(Map("value1" -> "min", "value2" -> "min"))

In [4]:
import org.apache.spark.sql.functions._

val recStatsGroup = recDF.groupBy().agg(min("value1"), min("value2"))

recStatsGroup.columns
 
recStatsGroup.first().toSeq.toArray.map(_.toString.toDouble)


recStatsGroup = [min(value1): int, min(value2): double]


Array(-2.0, 0.7)

## More Statistics on DataFrames 

-	More statistics are available through the stats method in a DataFrame 
-	It returns a `DataFrameStatsFunctions` object, which has the following methods:
-	`corr()` - computes Pearson correlation between two columns
-	`cov()` - computes sample covariance between two columns
- `crosstab()` - Computes a pair-wise frequency table of the given columns 
- `freqItems()` - finds frequent items for columns, possibly with false positives

In [5]:
val  recDFStat = recDF.stat

println("correlation value1 and value2",recDFStat.corr("value1", "value2"))
println("correlation value1 and value2",recDFStat.cov("value1", "value2"))
recDFStat.freqItems(Seq("value1"), 0.3) .show()

(correlation value1 and value2,-0.5879120879120879)
(correlation value1 and value2,-3.5666666666666673)
+----------------+
|value1_freqItems|
+----------------+
|      [-2, 1, 6]|
+----------------+



recDFStat = org.apache.spark.sql.DataFrameStatFunctions@50753c30


org.apache.spark.sql.DataFrameStatFunctions@50753c30

## Sampling on DataFrames 

-	Can be performed on any `DataFrame`
-	Returns a sampled subset of a `DataFrame`
-	Sampling with or without replacement
- Fraction: expected fraction of rows to generate
-	Can be used on bootstrapping procedures

In [6]:

val df = spark.createDataFrame(Seq((1, 10), (1, 20), (2, 10), (2, 20), (2, 30), (3, 20), (3, 30))).toDF("key", "value")
df.show()
val dfSampled = df.sample(withReplacement=false, fraction=0.3, seed=11L)
dfSampled.show()

+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  1|   20|
|  2|   10|
|  2|   20|
|  2|   30|
|  3|   20|
|  3|   30|
+---+-----+

+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  1|   20|
|  3|   20|
|  3|   30|
+---+-----+



df = [key: int, value: int]
dfSampled = [key: int, value: int]


[key: int, value: int]

## Random Split on DataFrames

-	Can be performed on any DataFrame
-	Returns an array of DataFrames
-	Weighs for the split will be normalized if the do not add up to 1
-	Useful for splitting a data set into training, test and validation sets

In [7]:
val dfSplit = df.randomSplit(weights=Array(0.3, 0.7), seed=11L) 

dfSplit(0).show()
dfSplit(1).show()

+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  1|   20|
|  3|   20|
|  3|   30|
+---+-----+

+---+-----+
|key|value|
+---+-----+
|  2|   10|
|  2|   20|
|  2|   30|
+---+-----+



dfSplit = Array([key: int, value: int], [key: int, value: int])


Array([key: int, value: int], [key: int, value: int])

## Stratified Sampling on DataFrames 

-	Can be performed on any `DataFrame` 
- Any column may work as key
-	Without replacement
-	Fraction: specified by key
-	Available as `sampleBy` function in `DataFrameStatFunctions`

In [8]:
df.stat.
    sampleBy(col="key",
    fractions=Map(1 -> 0.7, 2 -> 0.7, 3 -> 0.7),
    seed=11L).show()

+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  2|   10|
|  2|   20|
|  3|   30|
+---+-----+



## Random Data Generation 

-	SQL functions to generate columns filled with random values 
-	Two supported distributions: uniform and normal
-	Useful for randomized algorithms, prototyping and performance testing

In [9]:

import  org.apache.spark.sql.functions.{rand, randn}

val df = spark.range(0, 10)

df.select("id").withColumn("uniform", rand(10L)).
    withColumn("normal", randn(10L)).show()

+---+-------------------+--------------------+
| id|            uniform|              normal|
+---+-------------------+--------------------+
|  0|0.41371264720975787| -0.5877482396744728|
|  1| 0.7311719281896606|  1.5746327759749246|
|  2| 0.9031701155118229|  -2.087434531229601|
|  3|0.09430205113458567|  1.0191385374853092|
|  4|0.38340505276222947|-0.01130602009482...|
|  5| 0.5569246135523511| -1.4651299919940128|
|  6| 0.4977441406613893| -1.1978785320746455|
|  7| 0.2076666106201438|  1.2609253944513816|
|  8| 0.9571919406508957|   0.851458707097714|
|  9| 0.7429395461204413|  1.3459954052313476|
+---+-------------------+--------------------+



df = [id: bigint]


[id: bigint]

## Lesson Summary

-	Having completed this lesson, you should be able to:
- Compute column summary statistics 
-	Compute pairwise statistics between series/columns
-	Perform standard sampling on any `DataFrame` 
-	Split any `DataFrame` randomly into subsets 
-	Perform stratified sampling on `DataFrames`
-	Generate Random Data from Uniform and Normal Distributions

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.