Aggregating is the act of collecting something together and is a cornerstone of big data analytics.
In an aggregation, you will specify a key or grouping and an aggregation function that specifies
how you should transform one or more columns. This function must produce one result for each
group, given multiple input values. Spark’s aggregation capabilities are sophisticated and mature,
with a variety of different use cases and possibilities

* Spark also allows us to create the following groupings types
  * The **simplest grouping** is to just summarize a complete DataFrame by performing an aggregation in a select statement.
  * A **group by** allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns.
  * A **window** gives you the ability to specify one or more keys as well as one or more aggregation functions to transform the value columns. However, the rows input to the function are somehow related to the current row.
  * A **grouping set,** which you can use to aggregate at multiple different levels. Grouping sets are available as a primitive in SQL and via rollups and cubes in DataFrames.
  * A **rollup** makes it possible for you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized hierarchically.
  * A **cube** allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized across all combinations of columns.

* Each grouping returns a RelationalGroupedDataset on which we specify our aggregations.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Chapter 7").getOrCreate()

23/07/26 07:40:29 WARN Utils: Your hostname, FM-PC-LT-342 resolves to a loopback address: 127.0.1.1; using 192.168.1.87 instead (on interface wlp0s20f3)
23/07/26 07:40:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/26 07:40:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

In [5]:
df = spark.read.format("csv").\
    option("header","true").\
    option("inferSchema","true").\
    load("/home/fm-pc-lt-342/Documents/Spark Docx/spark_practice/datas/retail-data/online-retail-dataset.csv").coalesce(5)

In [6]:
df.cache()
df.createOrReplaceTempView("dfTable")

In [7]:
df.count()

                                                                                

541909

### Aggregration Functions

* ### count

The first function worth going over is count, except in this example it will perform as a transformation instead of an action. In this case, we can do one of two things: specify a specificcolumn to count, or all the columns by using count(*) or count(1) to represent that we want to count every row as the literal one, as shown in this example:

In [8]:
from pyspark.sql.functions import count

df.select(count("StockCode")).show()

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+



* ### count distinct

Sometimes, the total number is not relevant; rather, it’s the number of unique groups that you
want. To get this number, you can use the countDistinct function. This is a bit more relevant
for individual columns:

In [9]:
from pyspark.sql.functions import countDistinct

df.select(countDistinct("StockCode")).show()

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+



* ### approx_count_distinct

Often, we find ourselves working with large datasets and the exact distinct count is irrelevant.
There are times when an approximation to a certain degree of accuracy will work just fine, and
for that, you can use the approx_count_distinct function:

In [10]:
from pyspark.sql.functions import approx_count_distinct

df.select(approx_count_distinct("StockCode",0.1)).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+



You will notice that approx_count_distinct took another parameter with which you can
specify the maximum estimation error allowed. In this case, we specified a rather large error and
thus receive an answer that is quite far off but does complete more quickly than countDistinct.
You will see much greater performance gains with larger datasets.

* ### first and last

You can get the first and last values from a DataFrame by using these two obviously named
functions. This will be based on the rows in the DataFrame, not on the values in the DataFrame:

In [13]:
from pyspark.sql.functions import first, last
df.select(first("StockCode").alias("firstvalue"), last("StockCode").alias("lastvalue")).show()

+----------+---------+
|firstvalue|lastvalue|
+----------+---------+
|    85123A|    22138|
+----------+---------+



* ### min and max

To extract the minimum and maximum values from a DataFrame, use the min and max functions:

In [16]:
from pyspark.sql.functions import min, max, col

df.select(min(col("Quantity")),max("Quantity")).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+



* ### sum

Another simple task is to add all the values in a row using the sum function:

In [17]:
from pyspark.sql.functions import sum

df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



* ### sum distinct

In addition to summing a total, you also can sum a distinct set of values by using the
sumDistinct function:

In [18]:
from pyspark.sql.functions import sumDistinct

df.select(sumDistinct("Quantity")).show() # 29310



+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



* ### avg

Although you can calculate average by dividing sum by count, Spark provides an easier way to
get that value via the avg or mean functions. In this example, we use alias in order to more
easily reuse these columns later:

In [20]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
count("Quantity").alias("total_transactions"),
sum("Quantity").alias("total_purchases"),
avg("Quantity").alias("avg_purchases"),
expr("mean(Quantity)").alias("mean_purchases"))\
.selectExpr(
"total_purchases/total_transactions",
"avg_purchases",
"mean_purchases").show()

+--------------------------------------+----------------+----------------+
|(total_purchases / total_transactions)|   avg_purchases|  mean_purchases|
+--------------------------------------+----------------+----------------+
|                      9.55224954743324|9.55224954743324|9.55224954743324|
+--------------------------------------+----------------+----------------+



* ### Variance and Standard Deviation

Calculating the mean naturally brings up questions about the variance and standard deviation.
These are both measures of the spread of the data around the mean. The variance is the average
of the squared differences from the mean, and the standard deviation is the square root of the
variance. You can calculate these in Spark by using their respective functions. However,
something to note is that Spark has both the formula for the sample standard deviation as well as
the formula for the population standard deviation. These are fundamentally different statistical
formulae, and we need to differentiate between them. By default, Spark performs the formula for
the sample standard deviation or variance if you use the variance or stddev functions.
You can also specify these explicitly or refer to the population standard deviation or variance:

In [21]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),
stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+-----------------+------------------+--------------------+---------------------+
|var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+-----------------+------------------+--------------------+---------------------+
| 47559.3036466091|  47559.3914092988|  218.08095663447807|   218.08115785023426|
+-----------------+------------------+--------------------+---------------------+



* ### skewness and kurtosis

Skewness and kurtosis are both measurements of extreme points in your data. Skewness
measures the asymmetry of the values in your data around the mean, whereas kurtosis is a
measure of the tail of data. These are both relevant specifically when modeling your data as a
probability distribution of a random variable. Although here we won’t go into the math behind
these specifically, you can look up definitions quite easily on the internet. You can calculate
these by using the functions:

In [22]:
from pyspark.sql.functions import skewness, kurtosis

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+--------------------+------------------+
|  skewness(Quantity)|kurtosis(Quantity)|
+--------------------+------------------+
|-0.26407557610529564|119768.05495534712|
+--------------------+------------------+



* ### Covariance and Correlation

We discussed single column aggregations, but some functions compare the interactions of the
values in two difference columns together. Two of these functions are cov and corr, for
covariance and correlation, respectively. Correlation measures the Pearson correlation
coefficient, which is scaled between –1 and +1. The covariance is scaled according to the inputs
in the data.

In [23]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
covar_pop("InvoiceNo", "Quantity")).show()

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     4.912186085642775E-4|             1052.7280543915654|            1052.7260778754612|
+-------------------------+-------------------------------+------------------------------+



### Aggregating to Complex Types