# 4일차 5교시 집계 연산

### 목차
* 1. 집계 함수 예제
* 2. Group By 예제
* 3. SparkContext vs. SparkSession

### 참고 사이트
* [PySpark Search](https://spark.apache.org/docs/latest/api/python/search.html)
* [Pyspark Functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#module-pyspark.sql.functions)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession \
    .builder \
    .config("spark.sql.session.timeZone", "Asia/Seoul") \
    .getOrCreate()

In [2]:
""" 구매 이력 데이터 """
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true")\
    .load("data/retail-data/all") \
    .coalesce(5)
df.cache()
df.createOrReplaceTempView("dfTable")

In [3]:
df.show(5, truncate=False)
df.count()

+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |12/1/2010 8:26|2.75     |17850     |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
+---------+---------+-----------------------------------

541909

## 1. 집계 함수
### 1.1 로우 수 (count, countDistinct, approx_count_distinct)

In [4]:
from pyspark.sql.functions import *
df.printSchema()
df.selectExpr("count(*)").show()
df.where("Description is null").selectExpr("count(1)").show() # 1,454
df.selectExpr("count(Description)").show() # 540,455 + 1,454 = 541,909

# 명시적으로 컬럼을 지정한 경우 해당 컬럼이 널 인 경우 해당 로우는 제외됩니다
df.select(countDistinct("StockCode")).show()
df.selectExpr("count(distinct StockCode)").show()

# 근사치로 구하지만 연산 속도가 빠름
df.select(approx_count_distinct("StockCode", 0.1)).show() # 0.1은 최대 추정 오류율
df.select(approx_count_distinct("StockCode", 0.01)).show() 

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+--------+
|count(1)|
+--------+
|  541909|
+--------+

+--------+
|count(1)|
+--------+
|    1454|
+--------+

+------------------+
|count(Description)|
+------------------+
|            540455|
+------------------+

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+

+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+---------------------

### 1.2 수치 집계 함수 (first, last, min, max, sum, sumDistinct, avg)

In [5]:
from pyspark.sql.functions import *
df.select(first("StockCode"), last("StockCode")).show(1) # null도 감안하려면 True

df.select(min("Quantity"), max("Quantity")).show(1)
df.select(min("Description"), max("Description")).show(1) # 문자열

df.select(sum("Quantity")).show(1)
df.select(sumDistinct("Quantity")).show(1) # 고유값을 합산

+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
|                 85123A|                 22138|
+-----------------------+----------------------+

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+

+--------------------+-----------------+
|    min(Description)| max(Description)|
+--------------------+-----------------+
| 4 PURPLE FLOCK D...|wrongly sold sets|
+--------------------+-----------------+

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+

+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



### 1.3 통계 집계 함수 (avg, mean, variance, stddev) 
* 표본표준분산 및 편차: variance, stddev
* 모표준분산 및 편차 : var_pop, stddev_pop

In [6]:
from pyspark.sql.functions import *

df.select(
    count("Quantity").alias("total_transcations"),
    sum("Quantity").alias("total_purchases"),
    avg("Quantity").alias("avg_purchases"),
    expr("mean(Quantity)").alias("mean_transcations"),    
).selectExpr(
    "total_purchases / total_transcations",
    "avg_purchases",
    "mean_transcations").show(3)

df.select(variance("Quantity"), stddev("Quantity"),      
          var_samp("Quantity"), stddev_samp("Quantity"),
          var_pop("Quantity"), stddev_pop("Quantity")).show()


+--------------------------------------+----------------+-----------------+
|(total_purchases / total_transcations)|   avg_purchases|mean_transcations|
+--------------------------------------+----------------+-----------------+
|                      9.55224954743324|9.55224954743324| 9.55224954743324|
+--------------------------------------+----------------+-----------------+

+------------------+---------------------+------------------+---------------------+------------------+--------------------+
|var_samp(Quantity)|stddev_samp(Quantity)|var_samp(Quantity)|stddev_samp(Quantity)| var_pop(Quantity)|stddev_pop(Quantity)|
+------------------+---------------------+------------------+---------------------+------------------+--------------------+
|47559.391409298696|   218.08115785023404|47559.391409298696|   218.08115785023404|47559.303646609005|  218.08095663447784|
+------------------+---------------------+------------------+---------------------+------------------+--------------------+

## 2. 그룹핑 (Group By)

### 2.1 표현식을 이용한 그룹화

In [7]:
from pyspark.sql.functions import count
df.printSchema()
df.groupBy("InvoiceNo", "CustomerId").agg(expr("count(Quantity) as CountOfQuantity")).show(5)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+---------+----------+---------------+
|InvoiceNo|CustomerId|CountOfQuantity|
+---------+----------+---------------+
|   536846|     14573|             76|
|   537026|     12395|             12|
|   537883|     14437|              5|
|   538068|     17978|             12|
|   538279|     14952|              7|
+---------+----------+---------------+
only showing top 5 rows



### 2.2 맵을 이용한 그룹화
> 파이선의 딕셔너리 데이터 타입을 활용하여 집계함수의 표현이 가능 

In [8]:
df.groupBy("InvoiceNo").agg( { "Quantity" : "avg", "UnitPrice" : "stddev_pop" } ).show(5)

+---------+---------------------+------------------+
|InvoiceNo|stddev_pop(UnitPrice)|     avg(Quantity)|
+---------+---------------------+------------------+
|   536596|    6.618375094302897|               1.5|
|   536938|   2.4313249096586267|33.142857142857146|
|   537252|                  0.0|              31.0|
|   537691|    2.761232695735729|              8.15|
|   538041|                  0.0|              30.0|
+---------+---------------------+------------------+
only showing top 5 rows



### 실습#9 data/tbl_purchase.csv 데이터를 읽고, 최소, 최대 구매금액을 구하세요
> selectExpr, min, max

In [11]:
purchase = spark.read.option("header", "true").option("inferSchema", "true").csv("data/tbl_purchase.csv")
purchase.selectExpr("min(p_amount)", "max(p_amount)").show()

+-------------+-------------+
|min(p_amount)|max(p_amount)|
+-------------+-------------+
|      1000000|      4500000|
+-------------+-------------+



### 실습#10 data/tbl_purchase.csv 데이터를 읽고, 유저별(p_uid) 전체 구매금액(p_amount)을 출력하세요
> groupBy, sum

In [12]:
user = spark.read.option("header", "true").option("inferSchema", "true").csv("data/tbl_user.csv")

purchase.groupBy("p_uid").sum("p_amount").show()

+-----+-------------+
|p_uid|sum(p_amount)|
+-----+-------------+
|    1|      3800000|
|    3|      1000000|
|    5|      6000000|
|    4|      4500000|
|    2|      1400000|
+-----+-------------+

