# 7장. 집계함수

In [1]:
df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("file:///home/ubuntu/ybigta/Dataset_spark/data/retail-data/all/*.csv")\
    .coalesce(5)  
# coalesce : null 값 거르기
    
df.cache()       # 빠른 접근을 위해
df.createOrReplaceTempView("dfTable")
# 스키마 정보 출력
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



- 모든 집계함수는 함수를 사용하거나 DataFrame의 stat 속성을 사용

## 7.1 > 집계함수

- count : 전체 레코드 수를 구할때
- countDistinct : 고유 레코드 수를 구할때
- approx_count_distinct : 레코드 수 근사치 구하기~
- first / last : DataFrame의 첫번째 값 / 마지막 값
- min / max : 최소 / 최대
- sum : 합
- sumDistinct
- 평균 avg
- 분산과 표준편차 variance / stddev / var_pop / stddev_pop
- 비대칭도와 첨도(변곡점) skewness
- 공분산과 상관관계 covar_samp / covar_pop / corr
- 복합데이터 타입의 집계

- count

In [2]:
from pyspark.sql.functions import count, countDistinct, approx_count_distinct

df.select(count("StockCode")).show()

df.select(countDistinct("StockCode")).show()

df.select(approx_count_distinct("StockCode", 0.1)).show()

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+



- etc

In [3]:
from pyspark.sql.functions import first, last, min, max, sum, sumDistinct

df.select(first("StockCode"), last("StockCode")).show()

df.select(min("Quantity"), max("Quantity")).show()

df.select(sum("Quantity"), sumDistinct("Quantity")).show()

+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
|                 85123A|                 22138|
+-----------------------+----------------------+

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+

+-------------+----------------------+
|sum(Quantity)|sum(DISTINCT Quantity)|
+-------------+----------------------+
|      5176450|                 29310|
+-------------+----------------------+



- 평균

In [4]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
count("Quantity").alias("total_transaction"),
sum("Quantity").alias("total_purchases"),
avg("Quantity").alias("avg_purchases"),
expr("mean(Quantity)").alias("mean_purchases"))\
.selectExpr(
"total_purchases/total_transaction as `sum/count`",
"avg_purchases",
"mean_purchases").show()

+----------------+----------------+----------------+
|       sum/count|   avg_purchases|  mean_purchases|
+----------------+----------------+----------------+
|9.55224954743324|9.55224954743324|9.55224954743324|
+----------------+----------------+----------------+



- 분산, 표준편차

In [5]:
from pyspark.sql.functions import var_pop, stddev_pop          #모집단 분산/표준편차
from pyspark.sql.functions import var_samp, stddev_samp        #표본집단 분산/표준편차

df.select(var_pop("Quantity"), var_samp("Quantity"), stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+------------------+------------------+--------------------+---------------------+
| var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+------------------+------------------+--------------------+---------------------+
|47559.303646609354| 47559.39140929905|  218.08095663447864|   218.08115785023486|
+------------------+------------------+--------------------+---------------------+



- 비대칭도와 첨도

In [6]:
from pyspark.sql.functions import skewness, kurtosis

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

+--------------------+------------------+
|  skewness(Quantity)|kurtosis(Quantity)|
+--------------------+------------------+
|-0.26407557610527843|119768.05495536518|
+--------------------+------------------+



- 공분산과 상관관계

In [7]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"), 
         covar_pop("InvoiceNo", "Quantity")).show()

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     4.912186085637639E-4|             1052.7280543913773|            1052.7260778752732|
+-------------------------+-------------------------------+------------------------------+



- 복합데이터 다루기

In [8]:
from pyspark.sql.functions import collect_set, collect_list

df.agg(collect_set("Country"), collect_list("Country")).show()

+--------------------+---------------------+
|collect_set(Country)|collect_list(Country)|
+--------------------+---------------------+
|[Portugal, Italy,...| [United Kingdom, ...|
+--------------------+---------------------+



## 7.2 > 그룹화

- 그룹화작업
    1. 하나 이상의 컬럼을 그룹화 : RelationalGroupedDataset 반환
    2. 집계연산 수행 : DataFrame 반환

- 그룹화 groupBy

In [9]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
|   541225|     14099|   19|
|   541978|     13551|    4|
|   542093|     17677|   16|
|   543188|     12567|   63|
|   543590|     17377|   19|
|  C543757|     13115|    1|
|  C544318|     12989|    1|
|   544578|     12365|    1|
|   545165|     16339|   20|
|   545289|     14732|   30|
+---------+----------+-----+
only showing top 20 rows



agg 메서드를 사용하는 것이 좋다 : 집계처리를 한번에 지정가능 , 표현식 사용가능

In [11]:
from pyspark.sql.functions import count

df.groupBy("InvoiceNo").agg(count("Quantity").alias("quan"), expr("count(Quantity)")).show(2)

+---------+----+---------------+
|InvoiceNo|quan|count(Quantity)|
+---------+----+---------------+
|   536596|   6|              6|
|   536938|  14|             14|
+---------+----+---------------+
only showing top 2 rows



- 맵을 이용한 그룹화 map

In [12]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"), expr("stddev_pop(Quantity)")).show(2)

+---------+------------------+--------------------+
|InvoiceNo|     avg(Quantity)|stddev_pop(Quantity)|
+---------+------------------+--------------------+
|   536596|               1.5|  1.1180339887498947|
|   536938|33.142857142857146|  20.698023172885524|
+---------+------------------+--------------------+
only showing top 2 rows



## 7.3 > 윈도우 함수

In [33]:
from pyspark.sql.functions import col, to_date

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))
dfWithDate.createOrReplaceTempView("dfWithDate")

- window specification(윈도우 명세) 만들기

In [34]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc

# partitionBy 로 그룹화
windowSpec = Window.partitionBy("CustomerId", "date")\
                .orderBy(desc("Quantity"))\
                .rowsBetween(Window.unboundedPreceding, Window.currentRow)

- 윈도우 함수 적용

In [36]:
from pyspark.sql.functions import dense_rank, rank, max

maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

# dense_rank : 중복 순위를 부과하고 그 뒤로 순차적 반환
purchaseDenseRank = dense_rank().over(windowSpec)

# rank : 중복 순위를 부과하고 그 뒤로 그만큼 건너뛰고 반환
purchaseRank = rank().over(windowSpec)

- 윈도우 함수를 적용하면 컬럼이나 표현식을 반환 <br>
        -> DataFrame 의 select 구문에서 사용가능

In [38]:
from pyspark.sql.functions import col

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
        .select(col("CustomerId"), col("date"), col("Quantity"),
                purchaseRank.alias("quantityRank"), purchaseDenseRank.alias("quantityDenseRank"), 
               maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

+----------+----------+--------+------------+-----------------+-------------------+
|CustomerId|      date|Quantity|quantityRank|quantityDenseRank|maxPurchaseQuantity|
+----------+----------+--------+------------+-----------------+-------------------+
|     12346|2011-01-18|   74215|           1|                1|              74215|
|     12346|2011-01-18|  -74215|           2|                2|              74215|
|     12347|2010-12-07|      36|           1|                1|                 36|
|     12347|2010-12-07|      30|           2|                2|                 36|
|     12347|2010-12-07|      24|           3|                3|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|                 36|
|     12347|2010-12-07|      12|           4|                4|             

## 7.4 > 그룹화 셋

- 그룹화 셋은 null에 따라 집계수준이 달라짐. null을 제거하고 사용해야함
- 그룹화 셋(SQL), 큐브, 롤업에 똑같이 적용
- SQL에서만 사용

#### 1. 롤업

In [45]:
dfNoNull = dfWithDate.drop()
dfNoNull.createOrReplaceTempView("dfNoNull")

In [46]:
rolledUpDF = dfNoNull.rollup("Date", "Country").agg(sum("Quantity"))\
                    .selectExpr("Date", "Country", "`sum(Quantity)` as total_quantity")\
                    .orderBy("Date")
rolledUpDF.show()
# null : 전체 날짜의 합계

+----------+--------------+--------------+
|      Date|       Country|total_quantity|
+----------+--------------+--------------+
|      null|          null|       5176450|
|2010-12-01|     Australia|           107|
|2010-12-01|       Germany|           117|
|2010-12-01|        France|           449|
|2010-12-01|          EIRE|           243|
|2010-12-01|   Netherlands|            97|
|2010-12-01|          null|         26814|
|2010-12-01|United Kingdom|         23949|
|2010-12-01|        Norway|          1852|
|2010-12-02|       Germany|           146|
|2010-12-02|          null|         21023|
|2010-12-02|          EIRE|             4|
|2010-12-02|United Kingdom|         20873|
|2010-12-03|       Belgium|           528|
|2010-12-03|        Poland|           140|
|2010-12-03|       Germany|           170|
|2010-12-03|         Spain|           400|
|2010-12-03|        France|           239|
|2010-12-03|          null|         14830|
|2010-12-03|   Switzerland|           110|
+----------

#### 2. 큐브

In [49]:
from pyspark.sql.functions import sum

dfNoNull.cube("Date", "Country").agg(sum(col("Quantity")))\
    .select("Date", "Country", "sum(Quantity)").orderBy("Date").show()

+----+--------------------+-------------+
|Date|             Country|sum(Quantity)|
+----+--------------------+-------------+
|null|               Japan|        25218|
|null|           Australia|        83653|
|null|            Portugal|        16180|
|null|             Germany|       117448|
|null|                null|      5176450|
|null|             Finland|        10666|
|null|              Cyprus|         6317|
|null|             Lebanon|          386|
|null|                 USA|         1034|
|null|                 RSA|          352|
|null|           Singapore|         5234|
|null|           Hong Kong|         4769|
|null|United Arab Emirates|          982|
|null|  European Community|          497|
|null|               Spain|        26824|
|null|         Unspecified|         3300|
|null|     Channel Islands|         9479|
|null|             Denmark|         8188|
|null|              Norway|        19247|
|null|      Czech Republic|          592|
+----+--------------------+-------

#### 3. 그룹화 메타데이터

In [52]:
from pyspark.sql.functions import grouping_id, sum, expr

# dfNoNull.cube("customerId", "StockCode").agg(grouping_id(), sum("Quantity"))\
# .orderBy(col("grouping_id()").desc).show()

#### 4. 피벗

In [55]:
pivoted = dfWithDate.groupBy("date").pivot("Country").sum()

pivoted.where("date > `2011-12-05`").select("date", "`USA_sum(Quantity)`").show()

AnalysisException: "cannot resolve '`2011-12-05`' given input columns: [Sweden_sum(CAST(Quantity AS BIGINT)), Greece_sum(UnitPrice), France_sum(CAST(CustomerID AS BIGINT)), Germany_sum(CAST(Quantity AS BIGINT)), Lebanon_sum(CAST(Quantity AS BIGINT)), Belgium_sum(CAST(Quantity AS BIGINT)), Italy_sum(CAST(CustomerID AS BIGINT)), Sweden_sum(CAST(CustomerID AS BIGINT)), Czech Republic_sum(CAST(Quantity AS BIGINT)), Poland_sum(UnitPrice), European Community_sum(UnitPrice), Singapore_sum(UnitPrice), Belgium_sum(CAST(CustomerID AS BIGINT)), Lithuania_sum(CAST(Quantity AS BIGINT)), Bahrain_sum(CAST(CustomerID AS BIGINT)), Netherlands_sum(CAST(CustomerID AS BIGINT)), Canada_sum(UnitPrice), RSA_sum(UnitPrice), France_sum(CAST(Quantity AS BIGINT)), Cyprus_sum(CAST(Quantity AS BIGINT)), United Arab Emirates_sum(CAST(Quantity AS BIGINT)), Lebanon_sum(UnitPrice), Spain_sum(CAST(Quantity AS BIGINT)), Japan_sum(CAST(Quantity AS BIGINT)), Iceland_sum(UnitPrice), Cyprus_sum(CAST(CustomerID AS BIGINT)), Germany_sum(UnitPrice), Australia_sum(CAST(CustomerID AS BIGINT)), Sweden_sum(UnitPrice), Australia_sum(UnitPrice), Saudi Arabia_sum(CAST(Quantity AS BIGINT)), Canada_sum(CAST(Quantity AS BIGINT)), Denmark_sum(CAST(Quantity AS BIGINT)), Greece_sum(CAST(CustomerID AS BIGINT)), Iceland_sum(CAST(Quantity AS BIGINT)), Switzerland_sum(UnitPrice), Cyprus_sum(UnitPrice), Czech Republic_sum(CAST(CustomerID AS BIGINT)), Poland_sum(CAST(CustomerID AS BIGINT)), Spain_sum(CAST(CustomerID AS BIGINT)), EIRE_sum(CAST(CustomerID AS BIGINT)), Brazil_sum(CAST(CustomerID AS BIGINT)), Switzerland_sum(CAST(Quantity AS BIGINT)), Malta_sum(UnitPrice), United Kingdom_sum(CAST(Quantity AS BIGINT)), Channel Islands_sum(CAST(Quantity AS BIGINT)), Canada_sum(CAST(CustomerID AS BIGINT)), Denmark_sum(UnitPrice), Unspecified_sum(CAST(CustomerID AS BIGINT)), Channel Islands_sum(CAST(CustomerID AS BIGINT)), Lebanon_sum(CAST(CustomerID AS BIGINT)), Portugal_sum(CAST(Quantity AS BIGINT)), Netherlands_sum(CAST(Quantity AS BIGINT)), RSA_sum(CAST(Quantity AS BIGINT)), Hong Kong_sum(UnitPrice), Japan_sum(CAST(CustomerID AS BIGINT)), Portugal_sum(CAST(CustomerID AS BIGINT)), Finland_sum(UnitPrice), Saudi Arabia_sum(CAST(CustomerID AS BIGINT)), Finland_sum(CAST(Quantity AS BIGINT)), Unspecified_sum(CAST(Quantity AS BIGINT)), Czech Republic_sum(UnitPrice), Israel_sum(CAST(Quantity AS BIGINT)), Bahrain_sum(UnitPrice), Channel Islands_sum(UnitPrice), Saudi Arabia_sum(UnitPrice), Singapore_sum(CAST(Quantity AS BIGINT)), Brazil_sum(CAST(Quantity AS BIGINT)), USA_sum(CAST(CustomerID AS BIGINT)), Norway_sum(CAST(Quantity AS BIGINT)), Australia_sum(CAST(Quantity AS BIGINT)), Lithuania_sum(CAST(CustomerID AS BIGINT)), European Community_sum(CAST(Quantity AS BIGINT)), Norway_sum(CAST(CustomerID AS BIGINT)), Israel_sum(UnitPrice), USA_sum(CAST(Quantity AS BIGINT)), EIRE_sum(UnitPrice), date, Lithuania_sum(UnitPrice), Brazil_sum(UnitPrice), Malta_sum(CAST(CustomerID AS BIGINT)), United Arab Emirates_sum(CAST(CustomerID AS BIGINT)), Austria_sum(CAST(CustomerID AS BIGINT)), Italy_sum(CAST(Quantity AS BIGINT)), EIRE_sum(CAST(Quantity AS BIGINT)), Norway_sum(UnitPrice), Portugal_sum(UnitPrice), United Kingdom_sum(CAST(CustomerID AS BIGINT)), United Kingdom_sum(UnitPrice), Singapore_sum(CAST(CustomerID AS BIGINT)), Iceland_sum(CAST(CustomerID AS BIGINT)), Hong Kong_sum(CAST(Quantity AS BIGINT)), USA_sum(UnitPrice), Austria_sum(CAST(Quantity AS BIGINT)), Belgium_sum(UnitPrice), France_sum(UnitPrice), European Community_sum(CAST(CustomerID AS BIGINT)), Spain_sum(UnitPrice), Bahrain_sum(CAST(Quantity AS BIGINT)), Unspecified_sum(UnitPrice), Austria_sum(UnitPrice), Italy_sum(UnitPrice), Greece_sum(CAST(Quantity AS BIGINT)), Poland_sum(CAST(Quantity AS BIGINT)), Netherlands_sum(UnitPrice), RSA_sum(CAST(CustomerID AS BIGINT)), Germany_sum(CAST(CustomerID AS BIGINT)), Switzerland_sum(CAST(CustomerID AS BIGINT)), United Arab Emirates_sum(UnitPrice), Denmark_sum(CAST(CustomerID AS BIGINT)), Israel_sum(CAST(CustomerID AS BIGINT)), Hong Kong_sum(CAST(CustomerID AS BIGINT)), Japan_sum(UnitPrice), Malta_sum(CAST(Quantity AS BIGINT)), Finland_sum(CAST(CustomerID AS BIGINT))]; line 1 pos 7;\n'Filter (date#2054 > '2011-12-05)\n+- Project [date#2054, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[0] AS Australia_sum(CAST(Quantity AS BIGINT))#4151L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[0] AS Australia_sum(UnitPrice)#4152, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[0] AS Australia_sum(CAST(CustomerID AS BIGINT))#4153L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[1] AS Austria_sum(CAST(Quantity AS BIGINT))#4154L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[1] AS Austria_sum(UnitPrice)#4155, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[1] AS Austria_sum(CAST(CustomerID AS BIGINT))#4156L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[2] AS Bahrain_sum(CAST(Quantity AS BIGINT))#4157L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[2] AS Bahrain_sum(UnitPrice)#4158, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[2] AS Bahrain_sum(CAST(CustomerID AS BIGINT))#4159L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[3] AS Belgium_sum(CAST(Quantity AS BIGINT))#4160L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[3] AS Belgium_sum(UnitPrice)#4161, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[3] AS Belgium_sum(CAST(CustomerID AS BIGINT))#4162L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[4] AS Brazil_sum(CAST(Quantity AS BIGINT))#4163L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[4] AS Brazil_sum(UnitPrice)#4164, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[4] AS Brazil_sum(CAST(CustomerID AS BIGINT))#4165L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[5] AS Canada_sum(CAST(Quantity AS BIGINT))#4166L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[5] AS Canada_sum(UnitPrice)#4167, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[5] AS Canada_sum(CAST(CustomerID AS BIGINT))#4168L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[6] AS Channel Islands_sum(CAST(Quantity AS BIGINT))#4169L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[6] AS Channel Islands_sum(UnitPrice)#4170, __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150[6] AS Channel Islands_sum(CAST(CustomerID AS BIGINT))#4171L, __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994[7] AS Cyprus_sum(CAST(Quantity AS BIGINT))#4172L, __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072[7] AS Cyprus_sum(UnitPrice)#4173, ... 91 more fields]\n   +- Aggregate [date#2054], [date#2054, pivotfirst(Country#17, sum(CAST(`Quantity` AS BIGINT))#3914L, Australia, Austria, Bahrain, Belgium, Brazil, Canada, Channel Islands, Cyprus, Czech Republic, Denmark, EIRE, European Community, Finland, France, Germany, Greece, Hong Kong, Iceland, Israel, Italy, Japan, Lebanon, Lithuania, Malta, Netherlands, Norway, Poland, Portugal, RSA, Saudi Arabia, Singapore, Spain, Sweden, Switzerland, USA, United Arab Emirates, United Kingdom, Unspecified, 0, 0) AS __pivot_sum(CAST(`Quantity` AS BIGINT)) AS `sum(CAST(``Quantity`` AS BIGINT))`#3994, pivotfirst(Country#17, sum(`UnitPrice`)#3915, Australia, Austria, Bahrain, Belgium, Brazil, Canada, Channel Islands, Cyprus, Czech Republic, Denmark, EIRE, European Community, Finland, France, Germany, Greece, Hong Kong, Iceland, Israel, Italy, Japan, Lebanon, Lithuania, Malta, Netherlands, Norway, Poland, Portugal, RSA, Saudi Arabia, Singapore, Spain, Sweden, Switzerland, USA, United Arab Emirates, United Kingdom, Unspecified, 0, 0) AS __pivot_sum(`UnitPrice`) AS `sum(``UnitPrice``)`#4072, pivotfirst(Country#17, sum(CAST(`CustomerID` AS BIGINT))#3916L, Australia, Austria, Bahrain, Belgium, Brazil, Canada, Channel Islands, Cyprus, Czech Republic, Denmark, EIRE, European Community, Finland, France, Germany, Greece, Hong Kong, Iceland, Israel, Italy, Japan, Lebanon, Lithuania, Malta, Netherlands, Norway, Poland, Portugal, RSA, Saudi Arabia, Singapore, Spain, Sweden, Switzerland, USA, United Arab Emirates, United Kingdom, Unspecified, 0, 0) AS __pivot_sum(CAST(`CustomerID` AS BIGINT)) AS `sum(CAST(``CustomerID`` AS BIGINT))`#4150]\n      +- Aggregate [date#2054, Country#17], [date#2054, Country#17, sum(cast(Quantity#13 as bigint)) AS sum(CAST(`Quantity` AS BIGINT))#3914L, sum(UnitPrice#15) AS sum(`UnitPrice`)#3915, sum(cast(CustomerID#16 as bigint)) AS sum(CAST(`CustomerID` AS BIGINT))#3916L]\n         +- Project [InvoiceNo#10, StockCode#11, Description#12, Quantity#13, InvoiceDate#14, UnitPrice#15, CustomerID#16, Country#17, to_date('InvoiceDate, Some(MM/d/yyyy H:mm)) AS date#2054]\n            +- Repartition 5, false\n               +- Relation[InvoiceNo#10,StockCode#11,Description#12,Quantity#13,InvoiceDate#14,UnitPrice#15,CustomerID#16,Country#17] csv\n"

## 7.5 > 사용자 정의 집계함수

현재까지는 java와 scala에서만 지원!