### PySpark Random Sample with Example
It is a mechanism to get random sample records from dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data example 10% of the original file

for Datafrmae:
Synatx: `sample(withReplacement=False, fraction, seed=None)`

for RDD:
syntax:`sample(self, withReplacement, fraction, seed=None) `


`fraction`:  Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.

`seed`: Used to reproduce the same random sampling

`withReplacement`: Sample with replacement or not. Bydefault it's False.If the value is True then the some of the values are duplicated


#### 1.1 Using fraction to get a random sample in PySpark

By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.

In [0]:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder \
    .master('local[*]') \
    .appName('SparkByExample.com') \
    .getOrCreate()

In [0]:
df=spark.range(100)
print(df.sample(0.06).collect())

[Row(id=22), Row(id=28), Row(id=38), Row(id=55), Row(id=82), Row(id=92)]


In [0]:

rdd = spark.sparkContext.range(0,100)
print(rdd.sample(False,0.1,0).collect())
print(rdd.sample(True,0.3,123).collect())


[23, 48, 53, 60, 72, 87, 91, 96, 98]
[0, 11, 16, 18, 19, 23, 23, 24, 26, 26, 27, 29, 35, 38, 47, 49, 54, 54, 55, 61, 61, 66, 68, 81, 81, 82, 85, 97, 99]


####1.2 Using seed to reproduce the same Samples in PySpark

To get consistent same random sampling uses the same slice value for every run. Change slice value to get different results.


In [0]:
print(df.sample(0.1,123).collect())
print(df.sample(0.1,123).collect())
print(df.sample(0.1,456).collect())


[Row(id=35), Row(id=38), Row(id=41), Row(id=45), Row(id=71), Row(id=84), Row(id=87), Row(id=99)]
[Row(id=35), Row(id=38), Row(id=41), Row(id=45), Row(id=71), Row(id=84), Row(id=87), Row(id=99)]
[Row(id=22), Row(id=33), Row(id=35), Row(id=41), Row(id=53), Row(id=80), Row(id=83), Row(id=87), Row(id=92)]


#### 1.3 Sample withReplacement (May contain duplicates)


In [0]:
df.sample(True, 0.3, 123).collect() #with Duplicates
df.sample(0.3, 123).collect() #No duplicates

Out[7]: [Row(id=0),
 Row(id=4),
 Row(id=12),
 Row(id=15),
 Row(id=19),
 Row(id=21),
 Row(id=23),
 Row(id=24),
 Row(id=25),
 Row(id=28),
 Row(id=29),
 Row(id=34),
 Row(id=35),
 Row(id=36),
 Row(id=38),
 Row(id=41),
 Row(id=45),
 Row(id=47),
 Row(id=50),
 Row(id=52),
 Row(id=59),
 Row(id=63),
 Row(id=65),
 Row(id=71),
 Row(id=82),
 Row(id=84),
 Row(id=87),
 Row(id=94),
 Row(id=99)]


### PySpark fillna() & fill() – Replace NULL/None Values

syntax: 

`fillna(value, subset=None)`

`fill(value, subset=None)`

**value** – Value should be the data type of int, long, float, string, or dict. Value specified here will be replaced for NULL/None values.

**subset** – This is optional, when used it should be the subset of the column names where you wanted to replace NULL/None values.

In [0]:
#The file we are using here is available at GitHub small_zipcode.csv
# https://github.com/spark-examples/spark-scala-examples/blob/master/src/main/resources/small_zipcode.csv


filepath='/FileStore/tables/small_zipcode.csv'
df=spark.read.csv(filepath, header=True, inferSchema=True)
df.printSchema()
df.show(truncate=False)


root
 |-- id: integer (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+



In [0]:

#Replace 0 for null for all integer columns
df.na.fill(value=0).show()
#Replace 0 for null on only population column 
df.na.fill(value=0, subset=['population'])


+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|         0|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|         0|
+---+-------+--------+-------------------+-----+----------+

Out[12]: DataFrame[id: int, zipcode: int, type: string, city: string, state: string, population: int]

#### PySpark Replace Null/None Value with Empty String


In [0]:
df.na.fill(value='').show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|                   |   PR|     30100|
|  2|    704|        |PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|        |       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|                   |   TX|      null|
+---+-------+--------+-------------------+-----+----------+



In [0]:
#Now, let’s replace NULL’s on specific columns, below example replace column type with empty string and column city with value “unknown”.

df.na.fill('unknown',['city']) \
    .na.fill("", ['type']) \
    .na.fill(0,['population']).show() 

#Alternatively you can also write the above statement as

df.na.fill({'city':'unknown', 'type':"", 'population':0}).show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|            unknown|   PR|     30100|
|  2|    704|        |PASEO COSTA DEL SUR|   PR|         0|
|  3|    709|        |       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|            unknown|   TX|         0|
+---+-------+--------+-------------------+-----+----------+

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|            unknown|   PR|     30100|
|  2|    704|        |PASEO COSTA DEL SUR|   PR|         0|
|  3|    709|        |       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|            unkno

In [0]:
data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \
      ("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
      ("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
      ("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]

columns= ["Product","Amount","Country"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

root
 |-- Product: string (nullable = true)
 |-- Amount: long (nullable = true)
 |-- Country: string (nullable = true)

+-------+------+-------+
|Product|Amount|Country|
+-------+------+-------+
|Banana |1000  |USA    |
|Carrots|1500  |USA    |
|Beans  |1600  |USA    |
|Orange |2000  |USA    |
|Orange |2000  |USA    |
|Banana |400   |China  |
|Carrots|1200  |China  |
|Beans  |1500  |China  |
|Orange |4000  |China  |
|Banana |2000  |Canada |
|Carrots|2000  |Canada |
|Beans  |2000  |Mexico |
+-------+------+-------+



In [0]:
df.groupBy('Product').pivot('Country').sum('Amount').show()

+-------+------+-----+------+----+
|Product|Canada|China|Mexico| USA|
+-------+------+-----+------+----+
| Orange|  null| 4000|  null|4000|
|  Beans|  null| 1500|  2000|1600|
| Banana|  2000|  400|  null|1000|
|Carrots|  2000| 1200|  null|1500|
+-------+------+-----+------+----+



In [0]:
df.groupBy('Country').pivot('Product').sum('Amount').show()

#  pivot is a very expensive operation hence, it is recommended to provide column data (if known) as an argument to function as shown below.
countries = ["USA","China","Canada","Mexico"]
df.groupBy('Product').pivot('Country', countries).sum('Amount').show()

+-------+------+-----+-------+------+
|Country|Banana|Beans|Carrots|Orange|
+-------+------+-----+-------+------+
|  China|   400| 1500|   1200|  4000|
|    USA|  1000| 1600|   1500|  4000|
| Mexico|  null| 2000|   null|  null|
| Canada|  2000| null|   2000|  null|
+-------+------+-----+-------+------+

+-------+----+-----+------+------+
|Product| USA|China|Canada|Mexico|
+-------+----+-----+------+------+
| Orange|4000| 4000|  null|  null|
|  Beans|1600| 1500|  null|  2000|
| Banana|1000|  400|  2000|  null|
|Carrots|1500| 1200|  2000|  null|
+-------+----+-----+------+------+



In [0]:

pivotDF = df.groupBy("Product","Country") \
      .sum("Amount") \
      .groupBy("Product") \
      .pivot("Country") \
      .sum("sum(Amount)")
pivotDF.show(truncate=False)


+-------+------+-----+------+----+
|Product|Canada|China|Mexico|USA |
+-------+------+-----+------+----+
|Orange |null  |4000 |null  |4000|
|Beans  |null  |1500 |2000  |1600|
|Banana |2000  |400  |null  |1000|
|Carrots|2000  |1200 |null  |1500|
+-------+------+-----+------+----+



#### Unpivot PySpark DataFrame

Pyspark sql doesnot have unpivot function hence will use the **stack()** function....



In [0]:
from pyspark.sql.functions import expr
unpivotexpr = "stack(3, 'Canada', Canada,'China', China,'Mexico', Mexico) as (Country, Total)"
unpivotDF = pivotDF.select('Product', expr(unpivotexpr)).where('Total is not null')
unpivotDF.show()

+-------+-------+-----+
|Product|Country|Total|
+-------+-------+-----+
| Orange|  China| 4000|
|  Beans|  China| 1500|
|  Beans| Mexico| 2000|
| Banana| Canada| 2000|
| Banana|  China|  400|
|Carrots| Canada| 2000|
|Carrots|  China| 1200|
+-------+-------+-----+




### PySpark partitionBy() – Write to Disk Example

https://sparkbyexamples.com/pyspark/pyspark-partitionby-example/

### PySpark MapType (Dict) Usage with Examples


In [0]:
from pyspark.sql.types import StringType, MapType
mapCol = MapType(StringType(), StringType(), False)




- The first parame `keyType` is used to specify the type of key in the map
- the second param `valueType  is use dto specify the type of the value in the map
- Thirs param `valuContainsNull` is an optional boolean type thta is used to specify if the value of the second param can accept Null/None values.


In [0]:
from pyspark.sql.types import StructType, StructField, StringType, MapType
schema = StructType([
    StructField('name', StringType(), True),
    StructField('properties', MapType(StringType(), StringType()),True)
])

dataDictionary = [
        ('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'grey','eye':'grey'}),
        ('Jefferson',{'hair':'brown','eye':''})
        ]

df = spark.createDataFrame(data=dataDictionary, schema=schema)

print(df.printSchema())
print(df.show(truncate=False))


root
 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

None
+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |{eye -> brown, hair -> black}|
|Michael   |{eye -> null, hair -> brown} |
|Robert    |{eye -> black, hair -> red}  |
|Washington|{eye -> grey, hair -> grey}  |
|Jefferson |{eye -> , hair -> brown}     |
+----------+-----------------------------+

None


#### Access PySpark MapType Elements


In [0]:
df3=df.rdd.map(lambda x: (x.name, x.properties['hair'], x.properties['eye'])).toDF(['name','hair','eye'])
print(df3.show())

+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+

None


In [0]:
#Let’s use another way to get the value of a key from Map using getItem() of Column type, this method takes a key as an argument and returns a value.

df.withColumn('hair', df.properties.getItem('hair')) \
  .withColumn('eye', df.properties.getItem('eye')) \
   .drop('properties') \
    .show()

df.withColumn('hair', df.properties['hair']) \
    .withColumn('eye', df.properties['eye']) \
    .drop('properties') \
    .show()

+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+

+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+



In [0]:
#Below are some of the MapType Functions with examples.

from pyspark.sql.functions import explode
df.select('name', explode('properties')).show()


+----------+----+-----+
|      name| key|value|
+----------+----+-----+
|     James| eye|brown|
|     James|hair|black|
|   Michael| eye| null|
|   Michael|hair|brown|
|    Robert| eye|black|
|    Robert|hair|  red|
|Washington| eye| grey|
|Washington|hair| grey|
| Jefferson| eye|     |
| Jefferson|hair|brown|
+----------+----+-----+

