In [0]:
dbutils.fs.cp(
  'dbfs:/FileStore/Spark_The_Definitive_Guide_master.zip', 
  'file:/tmp/Spark_The_Definitive_Guide_master.zip'
)

Out[1]: True

In [0]:
%sh unzip /tmp/Spark_The_Definitive_Guide_master.zip -d /tmp > /tmp/unzipout 2>&1

Spark also allows us to create the following groupings types:
1) The simplest grouping is to just summarize a complete DataFrame by performing an aggregation in a select statement.
2) A “group by” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns.
3) A “window” gives you the ability to specify one or more keys as well as one or more aggregation functions to transform the value columns. However, the rows input to the function are somehow related to the current row.
4) A “grouping set,” which you can use to aggregate at multiple different levels. Grouping sets are available as a primitive in SQL and via rollups and cubes in DataFrames.
5) A “rollup” makes it possible for you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized hierarchically.
6) A “cube” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized across all combinations of columns.

Each grouping returns a RelationalGroupedDataset on which we specify our aggregations.

In [0]:
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("file:///tmp/Spark-The-Definitive-Guide-master/data/retail-data/all/*.csv")\
.coalesce(5)
df.cache()
df.createOrReplaceTempView("dfTable")
df.show(5, truncate=False)

+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |12/1/2010 8:26|2.75     |17850     |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
+---------+---------+-----------------------------------

The Basic Aggrigation Applies To Entire Data Frame

In [0]:
print (df.count())

541909


Most Of The Aggrigation function is available in
pyspark.sql.functions 

Count:



In [0]:
from pyspark.sql.functions import count
df.select(count(df.Quantity)).show()

+---------------+
|count(Quantity)|
+---------------+
|         541909|
+---------------+



There are a number of gotchas when it comes to null values and counting. For instance, when performing a count(*), Spark will count null values (including rows containing all nulls). However, when counting an individual column, Spark will not count the null values.

countDistinct

In [0]:
from pyspark.sql.functions import countDistinct
df.select(countDistinct(df.StockCode)).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



we find ourselves working with large datasets and the exact distinct count is irrelevant. There are times when an approximation to a certain degree of accuracy will work just fine, and for that, you can use the approx_count_distinct function:

In [0]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct(df.StockCode, 0.1)).show()
df.select(approx_count_distinct(df.StockCode, 0.2)).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            2944|
+--------------------------------+



first and last

In [0]:
from pyspark.sql.functions import first, last
df.select(first(df.StockCode), last(df.StockCode)).show()

+----------------+---------------+
|first(StockCode)|last(StockCode)|
+----------------+---------------+
|          85123A|          22138|
+----------------+---------------+



min and max

In [0]:
from pyspark.sql.functions import min, max
df.select(min(df.UnitPrice), max(df.UnitPrice)).show()

+--------------+--------------+
|min(UnitPrice)|max(UnitPrice)|
+--------------+--------------+
|     -11062.06|       38970.0|
+--------------+--------------+



In [0]:
# Sum
from pyspark.sql.functions import sum
df.select(sum(df.Quantity)).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



In [0]:
# sumDistinct
from pyspark.sql.functions import sumDistinct
df.select(sumDistinct(df.Quantity)).show()



+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



In [0]:
#avg
from pyspark.sql.functions import avg, count, sum, expr
df.select(avg(df.UnitPrice)).show()

df.select(
    count("Quantity").alias("total_transactions"),
    sum("Quantity").alias("total_purchases"),
    avg("Quantity").alias("avg_purchases"),
    expr("mean(Quantity)").alias("mean_purchases")
) \
.selectExpr(
    "total_purchases/total_transactions",
    "avg_purchases",
    "mean_purchases"
).show()

+-----------------+
|   avg(UnitPrice)|
+-----------------+
|4.611113626089214|
+-----------------+

+--------------------------------------+----------------+----------------+
|(total_purchases / total_transactions)|   avg_purchases|  mean_purchases|
+--------------------------------------+----------------+----------------+
|                      9.55224954743324|9.55224954743324|9.55224954743324|
+--------------------------------------+----------------+----------------+



In [0]:
# Variance and Standard Deviation
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),
stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+------------------+------------------+--------------------+---------------------+
| var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+------------------+------------------+--------------------+---------------------+
|47559.303646609056|47559.391409298754|  218.08095663447796|   218.08115785023418|
+------------------+------------------+--------------------+---------------------+



Skewness and kurtosis are both measurements of extreme points in your data. Skewness measures the asymmetry of the values in your data around the mean, whereas kurtosis is a measure of the tail of data. These are both relevant specifically when modeling your data as a probability distribution of a random variable.

In [0]:
# skewness and kurtosis
from pyspark.sql.functions import skewness, kurtosis
df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+-------------------+------------------+
| skewness(Quantity)|kurtosis(Quantity)|
+-------------------+------------------+
|-0.2640755761052562|119768.05495536952|
+-------------------+------------------+



We discussed single column aggregations, but some functions compare the interactions of the values in two difference columns together. Two of these functions are cov and corr, for covariance and correlation, respectively. Correlation measures the Pearson correlation coefficient, which is scaled between –1 and +1. The covariance is scaled according to the inputs in the data.

Like the var function, covariance can be calculated either as the sample covariance or the population covariance. Therefore it can be important to specify which formula you want to use. Correlation has no notion of this and therefore does not have calculations for population or
sample

In [0]:
from pyspark.sql.functions import corr, covar_pop, covar_samp
df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
covar_pop("InvoiceNo", "Quantity")).show()

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     4.912186085635685E-4|             1052.7280543902734|            1052.7260778741693|
+-------------------------+-------------------------------+------------------------------+



Aggregating to Complex Types<br>
We can collect list of values of set of unique values on column

In [0]:
from pyspark.sql.functions import collect_set, collect_list
df.agg(collect_set("Country"), collect_list("Country")).show()
df.agg(collect_set("Country"), collect_list("Country")).printSchema()
# Point To Note: Collect Set Also returns Array not Set

+--------------------+---------------------+
|collect_set(Country)|collect_list(Country)|
+--------------------+---------------------+
|[Portugal, Italy,...| [United Kingdom, ...|
+--------------------+---------------------+

root
 |-- collect_set(Country): array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- collect_list(Country): array (nullable = false)
 |    |-- element: string (containsNull = false)



Grouping Operations<br>

Grouping Can be done on one or more columns<br>
Grouping is Lazily Evaluating Functions Which returns RelationalGroupedDataset<br>
On the grouped Dataset we can further perform aggrigation which will return Dataframe, Defined aggrigations are also lazily eveluating


In [0]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
|   541225|     14099|   19|
|   541978|     13551|    4|
|   542093|     17677|   16|
|   536596|      null|    6|
|   537252|      null|    1|
|   538041|      null|    1|
|   537159|     14527|   28|
|   537213|     12748|    6|
|   538191|     15061|   16|
|  C539301|     13496|    1|
+---------+----------+-----+
only showing top 20 rows



In [0]:
# Grouping with Expression
from pyspark.sql.functions import count
df.groupBy("InvoiceNo").agg(
  count("Quantity").alias("quan"),
  expr("count(Quantity)")
).show()


+---------+----+---------------+
|InvoiceNo|quan|count(Quantity)|
+---------+----+---------------+
|   536596|   6|              6|
|   536938|  14|             14|
|   537252|   1|              1|
|   537691|  20|             20|
|   538041|   1|              1|
|   538184|  26|             26|
|   538517|  53|             53|
|   538879|  19|             19|
|   539275|   6|              6|
|   539630|  12|             12|
|   540499|  24|             24|
|   540540|  22|             22|
|  C540850|   1|              1|
|   540976|  48|             48|
|   541432|   4|              4|
|   541518| 101|            101|
|   541783|  35|             35|
|   542026|   9|              9|
|   542375|   6|              6|
|   536597|  28|             28|
+---------+----+---------------+
only showing top 20 rows



In [0]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
.show()

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
|   537252|              31.0|                 0.0|
|   537691|              8.15|   5.597097462078001|
|   538041|              30.0|                 0.0|
|   538184|12.076923076923077|   8.142590198943392|
|   538517|3.0377358490566038|  2.3946659604837897|
|   538879|21.157894736842106|  11.811070444356483|
|   539275|              26.0|  12.806248474865697|
|   539630|20.333333333333332|  10.225241100118645|
|   540499|              3.75|  2.6653642652865788|
|   540540|2.1363636363636362|  1.0572457590557278|
|  C540850|              -1.0|                 0.0|
|   540976|10.520833333333334|   6.496760677872902|
|   541432|             12.25|  10.825317547305483|
|   541518| 23.10891089108911|  20.550782784878713|
|   541783|1

Window Functions<br><br>

Window functions are user to carry out unique aggrigation on specific Window Of Data which we define by reference to the current data. This window specification determines which rows will be passed in to this function. <br>

The Difference Between Group By and Window Function
=> A group-by takes data and every row can go only into one grouping where else a window function calculates return value for every input row of a table based on a group of rows called a frame. Each Row can fall into one or more then one row. For Example In case of rolling average each one row will end up in seven different frames.

Spark supports three kinds of Window Function
1) Ranking Functions
2) Analytics Functions
3) Aggregate Functions





In [0]:
from pyspark.sql.functions import to_date
df.printSchema()
dfWithDate = df.withColumn("InvoiceDate", to_date(df["InvoiceDate"], "MM/d/yyyy H:mm"))
dfWithDate.show()
dfWithDate.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6| 2010-12-01|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6| 2010-12-01|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8| 2010-12-01|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6| 2010-12-01

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, col, max
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
dateWindow = Window \
    .partitionBy("CustomerID", "InvoiceDate")\
    .orderBy(desc("Quantity"))\
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

maxPurchaseQuantity = max(col("Quantity")).over(dateWindow)
print(maxPurchaseQuantity)
# print (type(maxPurchaseQuantity))
dfWithDate.withColumn("maxPurchaseQuantity", maxPurchaseQuantity).where(dfWithDate.CustomerID.isNotNull()).show()


Column<'max(Quantity) OVER (PARTITION BY CustomerID, InvoiceDate ORDER BY Quantity DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)'>
+---------+---------+--------------------+--------+-----------+---------+----------+-------+-------------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|maxPurchaseQuantity|
+---------+---------+--------------------+--------+-----------+---------+----------+-------+-------------------+
|   562032|   84558A|3D DOG PICTURE PL...|      36| 2011-08-02|     2.95|     12347|Iceland|                 36|
|   562032|    23308|SET OF 60 VINTAGE...|      24| 2011-08-02|     0.55|     12347|Iceland|                 36|
|   562032|    84992|72 SWEETHEART FAI...|      24| 2011-08-02|     0.55|     12347|Iceland|                 36|
|   562032|    84991|60 TEATIME FAIRY ...|      24| 2011-08-02|     0.55|     12347|Iceland|                 36|
|   562032|    21975|PACK OF 60 DINOSA...|      24| 2011-0