# Chapter 6. 다양한 데이터 타입 다루기
- 불리언 타입
- 수치 타입
- 문자열 타입
- Date 와 Timestamp 타입
- null값 다루기
- 복합 데이터 타입
- 사용자 정의 함수

In [2]:
# SpkarSession 
from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .master("local") \
    .appName("DataType") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

23/05/05 17:10:35 WARN Utils: Your hostname, yeobbug-ui-MacBookPro.local resolves to a loopback address: 127.0.0.1; using 192.168.219.123 instead (on interface en0)
23/05/05 17:10:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/05 17:10:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("../../data/retail-data-2010-12-01.csv")
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



In [None]:
df.createOrReplaceTempView("dfTable")

**6.2 스파크 데이터 타입으로 변경**
- `lit` 함수 -> 언어의 데이터 타입을 스파크 데이터 타입에 맞게 변환

In [11]:
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

**6.3 불리언 데이터타입 다루기**

In [15]:
from pyspark.sql.functions import col

df.where(col("InvoiceNo") != 536365).select("InvoiceNo", "Description").show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



In [18]:
# where 조건을 활용하여 더 간편하게 표현 가능
df.where("InvoiceNo = 536365").show(5, False)

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |2010-12-01 08:26:00|2.55     |17850.0   |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |2010-12-01 08:26:00|2.75     |17850.0   |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
+---------+-----

In [29]:
from pyspark.sql.functions import instr

price_filter = col("UnitPrice") > 600
description = instr(df.Description, "POSTAGE") >= 1
df.where(df.StockCode.isin("DOT")).where(price_filter & description).show()


+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [30]:
from pyspark.sql.functions import expr
df.withColumn("isExpensive", expr("NOT UnitPrice <= 250")).where("isExpensive").select("description", "UnitPrice").show(5)

+--------------+---------+
|   description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



**6.4 수치형 데이터 타입 다루기**
- `xpr, pow, round, lit, bround, describe`

In [36]:
from pyspark.sql.functions import expr, pow, round, lit, bround

fabricated_quantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
df.select(expr("CustomerID"), fabricated_quantity.alias("real_quantity")).show

<bound method DataFrame.show of DataFrame[CustomerID: double, real_quantity: double]>

In [34]:
df.select(round(col("UnitPrice") , 1).alias("rounded"), col("UnitPrice")).show(5)

+-------+---------+
|rounded|UnitPrice|
+-------+---------+
|    2.6|     2.55|
|    3.4|     3.39|
|    2.8|     2.75|
|    3.4|     3.39|
|    3.4|     3.39|
+-------+---------+
only showing top 5 rows



In [37]:
df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
|          3.0|           2.0|
+-------------+--------------+
only showing top 2 rows



In [39]:
df.describe().show()

[Stage 25:>                                                         (0 + 1) / 1]

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                null| 8.627413127413128| 4.151946589446603|15661.388719512195|          null|
| stddev|72.89447869788873|17407.897548583845|                null|26.371821677029203|15.638659854603892|1854.4496996893627|          null|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

                                                                                

In [41]:
# monotonically_increasing_id -> 모든 로우에 고유한 값 추가
from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id()).show(3)

+-----------------------------+
|monotonically_increasing_id()|
+-----------------------------+
|                            0|
|                            1|
|                            2|
+-----------------------------+
only showing top 3 rows



**문자열 데이터 타입 다루기**
- `initcap` : 모든 단어의 첫글자 대문자 변환


In [43]:
from pyspark.sql.functions import initcap

df.select(initcap(col("description"))).show(3)

+--------------------+
|initcap(description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
|Cream Cupid Heart...|
+--------------------+
only showing top 3 rows



In [45]:
from pyspark.sql.functions import lower, upper

df.select(col("description"), lower(col("description")), upper(col("description"))).show(3)

+--------------------+--------------------+
|         description|  lower(description)|
+--------------------+--------------------+
|WHITE HANGING HEA...|white hanging hea...|
| WHITE METAL LANTERN| white metal lantern|
|CREAM CUPID HEART...|cream cupid heart...|
+--------------------+--------------------+
only showing top 3 rows

