# 3일차 3교시 데이터 타입
> 스파크에서 사용되는 데이터 타입에 대해 실습합니다

### 참고 사이트
* [PySpark Search](https://spark.apache.org/docs/latest/api/python/search.html)
* [Pyspark Functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?#module-pyspark.sql.functions)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession \
    .builder \
    .config("spark.sql.session.timeZone", "Asia/Seoul") \
    .getOrCreate()

In [2]:
""" DataFrame 생성 """
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data/retail-data/by-day/2010-12-01.csv")
df.printSchema()
df.createOrReplaceTempView("retail")
df.show(5)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   8

## 1. 리터럴 타입

In [3]:
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

## 2. 불리언 형 데이터 타입 다루기
### 2.1 AND 조건

In [4]:
from pyspark.sql.functions import col

x1 = df.where(col("InvoiceNO") != 536365).select("InvoiceNO", "Description")
x2 = df.where("InvoiceNO <> 536365").select("InvoiceNO", "Description")
x3 = df.where("InvoiceNO = 536365").select("InvoiceNO", "Description")

x1.show(2)
x2.show(2)

+---------+--------------------+
|InvoiceNO|         Description|
+---------+--------------------+
|   536366|HAND WARMER UNION...|
|   536366|HAND WARMER RED P...|
+---------+--------------------+
only showing top 2 rows

+---------+--------------------+
|InvoiceNO|         Description|
+---------+--------------------+
|   536366|HAND WARMER UNION...|
|   536366|HAND WARMER RED P...|
+---------+--------------------+
only showing top 2 rows



### 2.2 OR 조건

In [5]:
from pyspark.sql.functions import instr
df.where("UnitPrice > 600 OR instr(Description, 'POSTAGE') >= 1").show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536370|     POST|       POSTAGE|       3|2010-12-01 08:45:00|     18.0|   12583.0|        France|
|   536403|     POST|       POSTAGE|       1|2010-12-01 11:27:00|     15.0|   12791.0|   Netherlands|
|   536527|     POST|       POSTAGE|       1|2010-12-01 13:04:00|     18.0|   12662.0|       Germany|
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



### 2.3 ISIN - 제공된 목록에 포함되었는지 여부

In [6]:
# SparkSQL 을 이용한 is in 구문 사용
from pyspark.sql.functions import desc
df.select('StockCode').where("StockCode in ('DOT', 'POST', 'C2')").distinct().show()

+---------+
|StockCode|
+---------+
|      DOT|
|     POST|
|       C2|
+---------+



### 2.4 INSTR - 특정 문자열이 포함되었는지 여부

In [7]:
from pyspark.sql.functions import *
""" instr 함수 """
df.withColumn("added", instr(df.Description, "POSTAGE")).where("added > 1").show() # 8번째 글자에 'POSTAGE'가 시작됨

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|added|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|    8|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|    8|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+-----+



## 3. 수치형 데이터 타입 다루기
### 3.1 각종 함수를 표현식으로 작성합니다

In [8]:
from pyspark.sql.functions import expr, pow
df.selectExpr("CustomerID", "pow(Quantity * UnitPrice, 2) + 5 as realQuantity").show(2)

+----------+------------------+
|CustomerID|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



### 3.2 반올림(round), 올림(ceil), 버림(floor)

In [9]:
from pyspark.sql.functions import *
df.selectExpr("round(2.5, 0)", "ceil(2.4)", "floor(2.6)").show(1)

+-------------+---------+----------+
|round(2.5, 0)|CEIL(2.4)|FLOOR(2.6)|
+-------------+---------+----------+
|            3|        3|         2|
+-------------+---------+----------+
only showing top 1 row



### 3.3 요약 통계

In [10]:
df.describe().show()
df.describe("InvoiceNo").show() # 컬럼을 입력

+-------+-----------------+------------------+--------------------+------------------+-------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|        InvoiceDate|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+-------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|               3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128|               null| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|               null|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             

+ StaFunctions 패키지의 통계 함수

## 4. 문자열 데이터 타입 다루기
### 4.1 첫 문자열만 대문자로 변경
* 공백으로 나뉘는 모든 단어의 첫 글자를 대문자로 변경, initcap

In [11]:
from pyspark.sql.functions import initcap
df.select(initcap(col("Description"))).show(2, False)

+----------------------------------+
|initcap(Description)              |
+----------------------------------+
|White Hanging Heart T-light Holder|
|White Metal Lantern               |
+----------------------------------+
only showing top 2 rows



### 4.2 대문자(upper), 소문자(lower)

In [12]:
from pyspark.sql.functions import lower, upper
df.selectExpr("Description", "lower(Description)", "upper(Description)").show(2)

+--------------------+--------------------+--------------------+
|         Description|  lower(Description)|  upper(Description)|
+--------------------+--------------------+--------------------+
|WHITE HANGING HEA...|white hanging hea...|WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern| WHITE METAL LANTERN|
+--------------------+--------------------+--------------------+
only showing top 2 rows



### 4.3 문자열 주변의 공백을 제거, lpad/ltrim/rpad/rtrim/trim

In [13]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
df.select(
    ltrim(lit("   HELLO   ")).alias("ltrim"),
    rtrim(lit("   HELLO   ")).alias("rtrim"),
    trim(lit("   HELLO   ")).alias("trim"),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")
).show(2)

+--------+--------+-----+---+----------+
|   ltrim|   rtrim| trim| lp|        rp|
+--------+--------+-----+---+----------+
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
|HELLO   |   HELLO|HELLO|HEL|HELLO     |
+--------+--------+-----+---+----------+
only showing top 2 rows



## 5. 정규 표현식
### 5.1 단어 치환, regexp_extract

In [14]:
from pyspark.sql.functions import regexp_replace
regex_string = "BLACK|WHITE|RED|GRENN|BLUE"
df.select(regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"), col("Description")).show(2, truncate=False)

+----------------------------------+----------------------------------+
|color_clean                       |Description                       |
+----------------------------------+----------------------------------+
|COLOR HANGING HEART T-LIGHT HOLDER|WHITE HANGING HEART T-LIGHT HOLDER|
|COLOR METAL LANTERN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



## 6. 날짜와 타임스팸프 데이터 타입 다루기
> 시간대 설정이 필요하다면 스파크 SQL 설정의 spark.conf.sessionLocalTimeZone 속성으로 가능 <br>
> TimestampType 클래스는 초 단위 정밀도만 지원 - 초 단위 이상 정밀도 요구 시 long 데이터 타입으로 데이터를 변환해 처리하는 우회 정책이 필요 <br>

### 6.1 오늘 날짜 구하기

In [15]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10) \
    .withColumn("today", current_date()) \
    .withColumn("now", current_timestamp())

dateDF.createOrReplaceTempView("dataTable")
dateDF.printSchema()

dateDF.show(3, False)

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)

+---+----------+--------------------------+
|id |today     |now                       |
+---+----------+--------------------------+
|0  |2020-12-10|2020-12-10 00:33:35.180493|
|1  |2020-12-10|2020-12-10 00:33:35.180493|
|2  |2020-12-10|2020-12-10 00:33:35.180493|
+---+----------+--------------------------+
only showing top 3 rows



### 6.2 날짜를 더하거나 빼기

In [16]:
from pyspark.sql.functions import date_sub, date_add
dateDF.select(
    date_sub(col("today"), 5),
    date_add(col("today"), 5)
).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2020-12-05|        2020-12-15|
+------------------+------------------+
only showing top 1 row



### 6.3 문자열을 날짜로 변환

In [17]:
from pyspark.sql.functions import to_date, lit

spark.range(5) \
    .withColumn("date", lit("2017-01-01")) \
    .select(to_date(col("date"))) \
    .show(1)

+---------------+
|to_date(`date`)|
+---------------+
|     2017-01-01|
+---------------+
only showing top 1 row



In [18]:
""" 파싱오류로 날짜가 null로 반환되는 사례 """
dateDF.select(to_date(lit("2016-20-12")), to_date(lit("2017-12-11"))).show(1) # 월과 일의 순서가 바뀜

+---------------------+---------------------+
|to_date('2016-20-12')|to_date('2017-12-11')|
+---------------------+---------------------+
|                 null|           2017-12-11|
+---------------------+---------------------+
only showing top 1 row



## 7. 널 값 다루기
+ null 값을 사용하는 것 보다 명시적으로 사용하는 것이 항상 좋음
+ null 값을 허용하지 않는 컬럼을 선언해도 강제성은 없음
+ nullable 속성은 스파크 SQL 옵티마이저가 해당 컬럼을 제어하는 동작을 단순하게 돕는 역할
+ null 값을 다루는 방법은 두 가지 
    + 명시적으로 null을 제거
    + 전역 또느 컬럼 단위로 null 값을 특정 값으로 채움

### 7-1. 컬럼 값에 따른 널 처리 함수 (ifnull, nullIf, nvl, nvl2)
+ SQL 함수이며 DataFrame의 select 표현식으로 사용 가능
    + ifnull(null, 'return_value') # 두 번째 값을, 아니라면 첫 번째 값을 반환 
    + nullif('value', 'value')     # 두 값이 같으면 null
    + nvl(null, 'return_value')    # 두 번째 값을, 아니라면 첫 번째 값을 반환
    + nvl2('not_null', 'return_value', 'else_value') # 두 번째 값을, 아니라면 세번째 값을 반환

In [19]:
spark.sql("""
SELECT
    ifnull(null, 'return_value'),
    nullif('value', 'value'),
    nvl(null, 'return_value'),
    nvl2('not null', 'return_value', 'else_value')
""").show()

+----------------------------+------------------------+-------------------------+----------------------------------------------+
|ifnull(NULL, 'return_value')|nullif('value', 'value')|nvl(NULL, 'return_value')|nvl2('not null', 'return_value', 'else_value')|
+----------------------------+------------------------+-------------------------+----------------------------------------------+
|                return_value|                    null|             return_value|                                  return_value|
+----------------------------+------------------------+-------------------------+----------------------------------------------+



### 7-2 컬럼의 널 값에 따른 로우 제거 (na.drop)

In [20]:
df.na.drop()
df.na.drop("any").show(1) # 로우 컬럼값 중 하나라도 null이면 제거
df.na.drop("all").show(1) # 로우 컬럼값 모두 null이면 제거

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---

In [21]:
# 배열 형태의 컬럼을 인수로 전달하여 지정한 컬럼만 제거합니다
df.na.drop("all", subset=("StockCode", "InvoiceNo")).show(1)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row



### 7.3 컬럼의 널 값에 따른 값을 채움 (na.fill)

In [22]:
""" null을 포함한 DataFrame 행성 """
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, DoubleType

myManualSchema = StructType([
    StructField("string_null", StringType(), True),
    StructField("string2_null", StringType(), True),
    StructField("number_null", DoubleType(), True)
])

myRows = []
myRows.append(Row("Hello", None, float(5))) # string 컬럼에 null 포함
myRows.append(Row(None, "World", None))     # number 컬럼에 null 포함

myDf = spark.createDataFrame(myRows, myManualSchema)
myDf.show()

myDf.na.fill( {"number_null": 5.0, "string_null": "not_null"} ).show()

+-----------+------------+-----------+
|string_null|string2_null|number_null|
+-----------+------------+-----------+
|      Hello|        null|        5.0|
|       null|       World|       null|
+-----------+------------+-----------+

+-----------+------------+-----------+
|string_null|string2_null|number_null|
+-----------+------------+-----------+
|      Hello|        null|        5.0|
|   not_null|       World|        5.0|
+-----------+------------+-----------+



### 실습#5 data/tbl_purchase.csv 파일을 읽고, 상품명(p_name)을 소문자로 출력하세요
> 참고: lower()

In [23]:
purchase = spark.read.option("header", "true").option("inferSchema", "true").csv("data/tbl_purchase.csv")
purchase.selectExpr("lower(p_name)").show()

+-------------+
|lower(p_name)|
+-------------+
|      lg dios|
|      lg gram|
|      lg cyon|
|        lg tv|
|  lg computer|
|      lg gram|
|        lg tv|
+-------------+



### 실습#6 data/tbl_user.csv 파일을 읽고, "now" 라고 하는 현재 시간을 출력하는 컬럼을 추가해서 출력하세요
> 참고: withColumn("column", "function"), current_timestamp()

In [24]:
from pyspark.sql.functions import *

user = spark.read.option("header", "true").option("inferSchema", "true").csv("data/tbl_user.csv")
user.withColumn("now", current_timestamp()).show(truncate=False)

+----+----------+--------+--------+--------------------------+
|u_id|u_name    |u_gender|u_signup|now                       |
+----+----------+--------+--------+--------------------------+
|1   |정휘센    |남      |19580808|2020-12-10 00:33:43.966966|
|2   |김싸이언  |남      |19590201|2020-12-10 00:33:43.966966|
|3   |박트롬    |여      |19951030|2020-12-10 00:33:43.966966|
|4   |청소기    |남      |19770329|2020-12-10 00:33:43.966966|
|5   |유코드제로|여      |20021029|2020-12-10 00:33:43.966966|
|6   |윤디오스  |남      |20040101|2020-12-10 00:33:43.966966|
|7   |임모바일  |남      |20040807|2020-12-10 00:33:43.966966|
|8   |조노트북  |여      |20161201|2020-12-10 00:33:43.966966|
|9   |최컴퓨터  |남      |20201124|2020-12-10 00:33:43.966966|
+----+----------+--------+--------+--------------------------+

