# "[Spark] PySpark Feature Engineering 기법들"
> pyspark feature engineering 기법들

- toc: true 
- badges: true
- comments: true
- categories: [Spark]
- tags: [spark, pyspark, feature, engineering]

In [1]:
import os
MINIO_ACCESS_KEY = os.environ['MINIO_ACCESS_KEY']
MINIO_SECRET_KEY = os.environ['MINIO_SECRET_KEY']

spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.access.key", MINIO_ACCESS_KEY)
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.secret.key", MINIO_SECRET_KEY)
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.endpoint", "http://lab101:10170")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.connection.ssl.enabled", "false")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.path.style.access", "true")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("com.amazonaws.services.s3.enableV2", "true")
spark.sparkContext._jsc.hadoopConfiguration()\
    .set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

In [2]:
sales = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("s3a://data/retail-data/by-day/*.csv")\
    .where("Description IS NOT NULL")

In [3]:
fakeIntDF = spark.read.parquet("s3a://data/simple-ml-integers")
simpleDF = spark.read.json("s3a://data/simple-ml")
scaleDF = spark.read.parquet("s3a://data/simple-ml-scaling")

- 세일즈 데이터셋은 여러번 엑세스할 것이므로 캐시를 함

In [5]:
sales.cache()
sales.show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom|
|   580538|    21544|SKULLS  WATER TRA...|      48|2011-12-05 08:38:00|     0.85|   14075.0|United Kingdom|
|   580538|    23126|FELTCRA

# 1. 변환자
- 다양한 방법으로 원시 데이터를 변환시키는 함수
- 하나의 변수를 두 개의 다른 변수로 변환하거나, 변수를 Double 타입 등으로 변환하는 기능
- 예) Tokenizer

In [12]:
from pyspark.ml.feature import Tokenizer

tkn = Tokenizer().setInputCol("Description")
tkn.transform(sales.select("Description")).show(5)

+--------------------+------------------------------+
|         Description|Tokenizer_f8c5e492a92d__output|
+--------------------+------------------------------+
|  RABBIT NIGHT LIGHT|          [rabbit, night, l...|
| DOUGHNUT LIP GLOSS |          [doughnut, lip, g...|
|12 MESSAGE CARDS ...|          [12, message, car...|
|BLUE HARMONICA IN...|          [blue, harmonica,...|
|   GUMBALL COAT RACK|          [gumball, coat, r...|
+--------------------+------------------------------+
only showing top 5 rows



# 2. 전처리 추정자
- 수행하려는 변환이 입력 컬럼에 대한 데이터 또는 정보로 초기화되어야 할 때 필요
- 특정 입력 데이터에 따라 구성되는 변환자
- 예) StandardScaler

In [13]:
from pyspark.ml.feature import StandardScaler

ss = StandardScaler().setInputCol("features")
ss.fit(scaleDF).transform(scaleDF).show(5)

+---+--------------+-----------------------------------+
| id|      features|StandardScaler_f10c68da2d39__output|
+---+--------------+-----------------------------------+
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  1|[3.0,10.1,3.0]|               [3.58568582800318...|
+---+--------------+-----------------------------------+



# 3. 고수준 변환자

## 3.1 RFormula
- 원 핫 인코딩을 수행해 문자열로 지정된 범주화된 입력변수를 자동으로 처리

In [15]:
from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")
supervised.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
|green|good|    15| 38.97187133755819|(10,[1,2,3,5,8],[...|  1.0|
|green|good|    12|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
|green| bad|    16|14.386294994851129|(10,[1,2,3,5,8],[...|  0.0|
|  red|good|    35|14.386294994851129|(10,[0,2,3,4,7],[...|  1.0|
|  red| bad|     1| 38.97187133755819|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|     2|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|    16|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red|good|    45| 38.97187133755819|(10,[0,2,3,4,7],[...|  1.0|
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| ba

## 3.2 SQL 변환자

In [16]:
from pyspark.ml.feature import SQLTransformer

basicTransformation = SQLTransformer()\
    .setStatement("""
        SELECT sum(Quantity), count(*), CustomerID
        FROM __THIS__
        GROUP BY CustomerID
    """)

basicTransformation.transform(sales).show()

+-------------+--------+----------+
|sum(Quantity)|count(1)|CustomerID|
+-------------+--------+----------+
|          119|      62|   14452.0|
|          440|     143|   16916.0|
|          630|      72|   17633.0|
|           34|       6|   14768.0|
|         1542|      30|   13094.0|
|          854|     117|   17884.0|
|           97|      12|   16596.0|
|          756|      67|   15145.0|
|           83|      13|   16858.0|
|           56|       4|   13160.0|
|         8873|      80|   16656.0|
|          241|      43|   16212.0|
|          258|      23|   13142.0|
|           67|      14|   13811.0|
|         1145|      62|   16600.0|
|          568|      87|   15898.0|
|        37720|    2491|   15311.0|
|         1467|      62|   18263.0|
|         2006|      94|   16353.0|
|         1486|     161|   17659.0|
+-------------+--------+----------+
only showing top 20 rows



## 3.3 벡터 조합기
- 모든 특징을 하나의 큰 벡터로 연결하여 추정자에 전달하는 기능을 제공

In [20]:
fakeIntDF.show()

+----+----+----+
|int1|int2|int3|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
|   7|   8|   9|
+----+----+----+



In [23]:
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler().setInputCols(["int1", "int2", "int3"])
va.transform(fakeIntDF).show()

+----+----+----+------------------------------------+
|int1|int2|int3|VectorAssembler_7237f61f3237__output|
+----+----+----+------------------------------------+
|   1|   2|   3|                       [1.0,2.0,3.0]|
|   4|   5|   6|                       [4.0,5.0,6.0]|
|   7|   8|   9|                       [7.0,8.0,9.0]|
+----+----+----+------------------------------------+



# 4. 연속형 특징 처리
- 버켓팅이라는 프로세스를 통해 연속형 특징을 범주형 특징으로 변환
- 여러 요구사항에 따라 특징을 스케일링하거나 정규화할 수 있음
- Double 타입에서만 작동하므로 다른 형식의 숫자값이 있다면 Double Type으로 변경해야 함

In [25]:
contDF = spark.range(20).selectExpr("cast(id as double)")
contDF.show()

+----+
|  id|
+----+
| 0.0|
| 1.0|
| 2.0|
| 3.0|
| 4.0|
| 5.0|
| 6.0|
| 7.0|
| 8.0|
| 9.0|
|10.0|
|11.0|
|12.0|
|13.0|
|14.0|
|15.0|
|16.0|
|17.0|
|18.0|
|19.0|
+----+



## 4.1 버켓팅
- Bucketizer를 사용하면 주어진 연속형 특징을 지정한 버켓으로 분할
- 예를 들어, 체중을 분석하고 싶을 때, 과체중, 평균, 저체중의 세 가지 버켓으로 나누어 활용하는 것이 간단한 접근이 될 수 있음
- 경계 지정 규칙
    - 분할 배열의 최소값은 DF의 최소값보다 작아야 함
    - 분할 배열의 최대값은 DF의 최대값보다 커야 함
    - 분할 배열은 최소 세 개 이상의 값을 지정해서 두 개 이상의 버켓을 만들도록 해야 함
- 가능한 모든 범위를 포함하기 위해 **float("inf"), float("-inf")** 를 사용할 수 있음

In [28]:
from pyspark.ml.feature import Bucketizer

bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0]
bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id")
bucketer.transform(contDF).show()

+----+-------------------------------+
|  id|Bucketizer_e67baba75dea__output|
+----+-------------------------------+
| 0.0|                            0.0|
| 1.0|                            0.0|
| 2.0|                            0.0|
| 3.0|                            0.0|
| 4.0|                            0.0|
| 5.0|                            1.0|
| 6.0|                            1.0|
| 7.0|                            1.0|
| 8.0|                            1.0|
| 9.0|                            1.0|
|10.0|                            2.0|
|11.0|                            2.0|
|12.0|                            2.0|
|13.0|                            2.0|
|14.0|                            2.0|
|15.0|                            2.0|
|16.0|                            2.0|
|17.0|                            2.0|
|18.0|                            2.0|
|19.0|                            2.0|
+----+-------------------------------+



- 기준을 직접 지정하지 않고 백분위수를 기준으로 분할하는 방법도 있음

In [30]:
from pyspark.ml.feature import QuantileDiscretizer

bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id").setOutputCol("result")
fittedBucketer = bucketer.fit(contDF)
fittedBucketer.transform(contDF).show()

+----+------+
|  id|result|
+----+------+
| 0.0|   0.0|
| 1.0|   0.0|
| 2.0|   0.0|
| 3.0|   1.0|
| 4.0|   1.0|
| 5.0|   1.0|
| 6.0|   1.0|
| 7.0|   2.0|
| 8.0|   2.0|
| 9.0|   2.0|
|10.0|   2.0|
|11.0|   3.0|
|12.0|   3.0|
|13.0|   3.0|
|14.0|   3.0|
|15.0|   4.0|
|16.0|   4.0|
|17.0|   4.0|
|18.0|   4.0|
|19.0|   4.0|
+----+------+



### 4.3 StandardScaler
- 특징들이 평균이 0이고 표준편차가 1인 분포를 갖도록 데이터를 표준화

In [31]:
from pyspark.ml.feature import StandardScaler

sScaler = StandardScaler().setInputCol("features")
sScaler.fit(scaleDF).transform(scaleDF).show()

+---+--------------+-----------------------------------+
| id|      features|StandardScaler_003a7eaa6e31__output|
+---+--------------+-----------------------------------+
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  0|[1.0,0.1,-1.0]|               [1.19522860933439...|
|  1| [2.0,1.1,1.0]|               [2.39045721866878...|
|  1|[3.0,10.1,3.0]|               [3.58568582800318...|
+---+--------------+-----------------------------------+



## 4.4 MinMaxScaler
- 벡터의 값을 주어진 최솟값에서 최댓값까지의 비례 값으로 스케일링 함

In [32]:
from pyspark.ml.feature import MinMaxScaler

minMax = MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()

+---+--------------+---------------------------------+
| id|      features|MinMaxScaler_7af6603d580f__output|
+---+--------------+---------------------------------+
|  0|[1.0,0.1,-1.0]|                    [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                    [7.5,5.5,7.5]|
|  0|[1.0,0.1,-1.0]|                    [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                    [7.5,5.5,7.5]|
|  1|[3.0,10.1,3.0]|                 [10.0,10.0,10.0]|
+---+--------------+---------------------------------+



## 4.5 MaxAbsScaler
- 각 값을 해당 컬럼의 최대 절댓값으로 나눠서 데이터의 범위를 조정
- 모든 값은 -1과 1사이에서 끝남

In [33]:
from pyspark.ml.feature import MaxAbsScaler

maScaler = MaxAbsScaler().setInputCol("features")
fittedmaScaler = maScaler.fit(scaleDF)
fittedmaScaler.transform(scaleDF).show()

+---+--------------+---------------------------------+
| id|      features|MaxAbsScaler_035b244c7f59__output|
+---+--------------+---------------------------------+
|  0|[1.0,0.1,-1.0]|             [0.33333333333333...|
|  1| [2.0,1.1,1.0]|             [0.66666666666666...|
|  0|[1.0,0.1,-1.0]|             [0.33333333333333...|
|  1| [2.0,1.1,1.0]|             [0.66666666666666...|
|  1|[3.0,10.1,3.0]|                    [1.0,1.0,1.0]|
+---+--------------+---------------------------------+



## 4.6 ElementwiseProduct
- 벡터의 각 값을 임의의 값으로 조정할 수 있음

In [35]:
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors

scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)
scalingUp = ElementwiseProduct()\
    .setScalingVec(scaleUpVec)\
    .setInputCol("features")
scalingUp.transform(scaleDF).show()

+---+--------------+---------------------------------------+
| id|      features|ElementwiseProduct_fd2f8252b564__output|
+---+--------------+---------------------------------------+
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  1|[3.0,10.1,3.0]|                      [30.0,151.5,60.0]|
+---+--------------+---------------------------------------+



## 4.7 Normalizer
- 여러 가지 표준 중 하나를 사용하여(파라미터 p로 지정) 다차원 벡터를 스케일링할 수 있음
- 예를 들어, 맨해튼 표준은 p=1, 유클리드 표준은 p=2

In [36]:
from pyspark.ml.feature import Normalizer

manhattanDistance = Normalizer().setP(1).setInputCol("features")
manhattanDistance.transform(scaleDF).show()

+---+--------------+-------------------------------+
| id|      features|Normalizer_3faba6829ca5__output|
+---+--------------+-------------------------------+
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  1|[3.0,10.1,3.0]|           [0.18633540372670...|
+---+--------------+-------------------------------+



# 5. 범주형 특징 처리

## 5.1 StringIndexer
- DF에 첨부된 메타데이터를 생성하여 어떤 입력이 어떤 출력에 해당하는지 지정, 이렇게 하면 나중에 각 색인값에서 입력값을 다시 가져올 수 있음

In [38]:
from pyspark.ml.feature import StringIndexer

lblIndxr = StringIndexer().setInputCol("lab").setOutputCol("labelInd")
idxRes = lblIndxr.fit(simpleDF).transform(simpleDF)
idxRes.show()

+-----+----+------+------------------+--------+
|color| lab|value1|            value2|labelInd|
+-----+----+------+------------------+--------+
|green|good|     1|14.386294994851129|     1.0|
| blue| bad|     8|14.386294994851129|     0.0|
| blue| bad|    12|14.386294994851129|     0.0|
|green|good|    15| 38.97187133755819|     1.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     0.0|
|  red|good|    35|14.386294994851129|     1.0|
|  red| bad|     1| 38.97187133755819|     0.0|
|  red| bad|     2|14.386294994851129|     0.0|
|  red| bad|    16|14.386294994851129|     0.0|
|  red|good|    45| 38.97187133755819|     1.0|
|green|good|     1|14.386294994851129|     1.0|
| blue| bad|     8|14.386294994851129|     0.0|
| blue| bad|    12|14.386294994851129|     0.0|
|green|good|    15| 38.97187133755819|     1.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     0.0|
|  red|good|    35|14.386294994851129|  

- 문자열이 아닌 컬럼에도 StringIndexer를 적용할 수 있음, 색인을 생성하기 전에 문자열로 변환

In [41]:
valIndexer = StringIndexer().setInputCol("value1").setOutputCol("valueInd")
valIndexer.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+--------+
|color| lab|value1|            value2|valueInd|
+-----+----+------+------------------+--------+
|green|good|     1|14.386294994851129|     0.0|
| blue| bad|     8|14.386294994851129|     7.0|
| blue| bad|    12|14.386294994851129|     1.0|
|green|good|    15| 38.97187133755819|     3.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     2.0|
|  red|good|    35|14.386294994851129|     5.0|
|  red| bad|     1| 38.97187133755819|     0.0|
|  red| bad|     2|14.386294994851129|     4.0|
|  red| bad|    16|14.386294994851129|     2.0|
|  red|good|    45| 38.97187133755819|     6.0|
|green|good|     1|14.386294994851129|     0.0|
| blue| bad|     8|14.386294994851129|     7.0|
| blue| bad|    12|14.386294994851129|     1.0|
|green|good|    15| 38.97187133755819|     3.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     2.0|
|  red|good|    35|14.386294994851129|  

- 학습할 때 보지 못한 입력이 들어오면 기본적으로 오류가 발생함
- 유효하지 않은 입력에 대해 오류를 출력하거나 로우를 건너뛰고 처리하는 옵션만 선택할 수 있음

In [42]:
valIndexer.setHandleInvalid("skip")
valIndexer.fit(simpleDF).setHandleInvalid("skip")

StringIndexerModel: uid=StringIndexer_fcfe830fcc9b, handleInvalid=skip

## 5.2 색인된 값을 텍스트로 변환
- 다양하게 변환된 결과값을 기존 값으로 다시 매핑하여 진행

In [43]:
from pyspark.ml.feature import IndexToString

labelReverse = IndexToString().setInputCol("labelInd")
labelReverse.transform(idxRes).show()

+-----+----+------+------------------+--------+----------------------------------+
|color| lab|value1|            value2|labelInd|IndexToString_b45fc27c6240__output|
+-----+----+------+------------------+--------+----------------------------------+
|green|good|     1|14.386294994851129|     1.0|                              good|
| blue| bad|     8|14.386294994851129|     0.0|                               bad|
| blue| bad|    12|14.386294994851129|     0.0|                               bad|
|green|good|    15| 38.97187133755819|     1.0|                              good|
|green|good|    12|14.386294994851129|     1.0|                              good|
|green| bad|    16|14.386294994851129|     0.0|                               bad|
|  red|good|    35|14.386294994851129|     1.0|                              good|
|  red| bad|     1| 38.97187133755819|     0.0|                               bad|
|  red| bad|     2|14.386294994851129|     0.0|                               bad|
|  r

## 5.3 벡터 인덱싱

In [44]:
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors

idxIn = spark.createDataFrame([
    (Vectors.dense(1, 2, 3), 1),
    (Vectors.dense(2, 5, 6), 2),
    (Vectors.dense(1, 8, 9), 3)
]).toDF("features", "label")
idxIn.show()

+-------------+-----+
|     features|label|
+-------------+-----+
|[1.0,2.0,3.0]|    1|
|[2.0,5.0,6.0]|    2|
|[1.0,8.0,9.0]|    3|
+-------------+-----+



In [45]:
indxr = VectorIndexer()\
    .setInputCol("features")\
    .setOutputCol("idxed")\
    .setMaxCategories(2)
indxr.fit(idxIn).transform(idxIn).show()

+-------------+-----+-------------+
|     features|label|        idxed|
+-------------+-----+-------------+
|[1.0,2.0,3.0]|    1|[0.0,2.0,3.0]|
|[2.0,5.0,6.0]|    2|[1.0,5.0,6.0]|
|[1.0,8.0,9.0]|    3|[0.0,8.0,9.0]|
+-------------+-----+-------------+



## 5.4 원-핫 인코딩

In [48]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

lblIndxr = StringIndexer().setInputCol("color").setOutputCol("colorInd")
colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))
ohe = OneHotEncoder().setInputCol("colorInd")
ohe.fit(colorLab).transform(colorLab).show()

+-----+--------+----------------------------------+
|color|colorInd|OneHotEncoder_650f9bb5b001__output|
+-----+--------+----------------------------------+
|green|     1.0|                     (2,[1],[1.0])|
| blue|     2.0|                         (2,[],[])|
| blue|     2.0|                         (2,[],[])|
|green|     1.0|                     (2,[1],[1.0])|
|green|     1.0|                     (2,[1],[1.0])|
|green|     1.0|                     (2,[1],[1.0])|
|  red|     0.0|                     (2,[0],[1.0])|
|  red|     0.0|                     (2,[0],[1.0])|
|  red|     0.0|                     (2,[0],[1.0])|
|  red|     0.0|                     (2,[0],[1.0])|
|  red|     0.0|                     (2,[0],[1.0])|
|green|     1.0|                     (2,[1],[1.0])|
| blue|     2.0|                         (2,[],[])|
| blue|     2.0|                         (2,[],[])|
|green|     1.0|                     (2,[1],[1.0])|
|green|     1.0|                     (2,[1],[1.0])|
|green|     

In [49]:
colorLab.show()

+-----+--------+
|color|colorInd|
+-----+--------+
|green|     1.0|
| blue|     2.0|
| blue|     2.0|
|green|     1.0|
|green|     1.0|
|green|     1.0|
|  red|     0.0|
|  red|     0.0|
|  red|     0.0|
|  red|     0.0|
|  red|     0.0|
|green|     1.0|
| blue|     2.0|
| blue|     2.0|
|green|     1.0|
|green|     1.0|
|green|     1.0|
|  red|     0.0|
|  red|     0.0|
|  red|     0.0|
+-----+--------+
only showing top 20 rows



# 6. 텍스트 데이터 변환자

## 6.1 텍스트 토큰화

In [4]:
from pyspark.ml.feature import Tokenizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.select("Description"))
tokenized.show(20, False)

+-----------------------------------+------------------------------------------+
|Description                        |DescOut                                   |
+-----------------------------------+------------------------------------------+
|RABBIT NIGHT LIGHT                 |[rabbit, night, light]                    |
|DOUGHNUT LIP GLOSS                 |[doughnut, lip, gloss]                    |
|12 MESSAGE CARDS WITH ENVELOPES    |[12, message, cards, with, envelopes]     |
|BLUE HARMONICA IN BOX              |[blue, harmonica, in, box]                |
|GUMBALL COAT RACK                  |[gumball, coat, rack]                     |
|SKULLS  WATER TRANSFER TATTOOS     |[skulls, , water, transfer, tattoos]      |
|FELTCRAFT GIRL AMELIE KIT          |[feltcraft, girl, amelie, kit]            |
|CAMOUFLAGE LED TORCH               |[camouflage, led, torch]                  |
|WHITE SKULL HOT WATER BOTTLE       |[white, skull, hot, water, bottle]        |
|ENGLISH ROSE HOT WATER BOTT

- RegexTokenizer를 이용하면 공백뿐만 아니라 정규 표현식을 이용한 Tokenizer를 만들 수 있음
- Java의 정규 표현식<sup>RegEx</sup>구문을 준수해야 함

In [10]:
from pyspark.ml.feature import RegexTokenizer

rt = RegexTokenizer()\
    .setInputCol("Description")\
    .setOutputCol("DescOut")\
    .setPattern(" ")\
    .setToLowercase(True)
rt.transform(sales.select("Description")).show(20, False)

+-----------------------------------+------------------------------------------+
|Description                        |DescOut                                   |
+-----------------------------------+------------------------------------------+
|RABBIT NIGHT LIGHT                 |[rabbit, night, light]                    |
|DOUGHNUT LIP GLOSS                 |[doughnut, lip, gloss]                    |
|12 MESSAGE CARDS WITH ENVELOPES    |[12, message, cards, with, envelopes]     |
|BLUE HARMONICA IN BOX              |[blue, harmonica, in, box]                |
|GUMBALL COAT RACK                  |[gumball, coat, rack]                     |
|SKULLS  WATER TRANSFER TATTOOS     |[skulls, water, transfer, tattoos]        |
|FELTCRAFT GIRL AMELIE KIT          |[feltcraft, girl, amelie, kit]            |
|CAMOUFLAGE LED TORCH               |[camouflage, led, torch]                  |
|WHITE SKULL HOT WATER BOTTLE       |[white, skull, hot, water, bottle]        |
|ENGLISH ROSE HOT WATER BOTT

- 공백을 사용하는 것이 아니라 사전에 제시된 패턴에 매칭되는 값을 출력할 수 있음
- **gaps** 파라미터를 **false**로 설정하여 이 작업을 수행

In [13]:
from pyspark.ml.feature import RegexTokenizer

rt = RegexTokenizer()\
    .setInputCol("Description")\
    .setOutputCol("DescOut")\
    .setPattern(" ")\
    .setGaps(False)\
    .setToLowercase(True)
rt.transform(sales.select("Description")).show(20, False)

+-----------------------------------+------------------+
|Description                        |DescOut           |
+-----------------------------------+------------------+
|RABBIT NIGHT LIGHT                 |[ ,  ]            |
|DOUGHNUT LIP GLOSS                 |[ ,  ,  ]         |
|12 MESSAGE CARDS WITH ENVELOPES    |[ ,  ,  ,  ]      |
|BLUE HARMONICA IN BOX              |[ ,  ,  ,  ]      |
|GUMBALL COAT RACK                  |[ ,  ]            |
|SKULLS  WATER TRANSFER TATTOOS     |[ ,  ,  ,  ,  ]   |
|FELTCRAFT GIRL AMELIE KIT          |[ ,  ,  ]         |
|CAMOUFLAGE LED TORCH               |[ ,  ]            |
|WHITE SKULL HOT WATER BOTTLE       |[ ,  ,  ,  ,  ]   |
|ENGLISH ROSE HOT WATER BOTTLE      |[ ,  ,  ,  ]      |
|HOT WATER BOTTLE KEEP CALM         |[ ,  ,  ,  ]      |
|SCOTTIE DOG HOT WATER BOTTLE       |[ ,  ,  ,  ]      |
|ROSE CARAVAN DOORSTOP              |[ ,  ]            |
|GINGHAM HEART  DOORSTOP RED        |[ ,  ,  ,  ]      |
|STORAGE TIN VINTAGE LEAF      

## 6.2 일반적인 단어 제거

In [18]:
from pyspark.ml.feature import StopWordsRemover

englishStopWords = StopWordsRemover.loadDefaultStopWords("english")
stops = StopWordsRemover()\
    .setStopWords(englishStopWords)\
    .setInputCol("DescOut")
stops.transform(tokenized).show(5, False)

+-------------------------------+-------------------------------------+-------------------------------------+
|Description                    |DescOut                              |StopWordsRemover_52dbe61992eb__output|
+-------------------------------+-------------------------------------+-------------------------------------+
|RABBIT NIGHT LIGHT             |[rabbit, night, light]               |[rabbit, night, light]               |
|DOUGHNUT LIP GLOSS             |[doughnut, lip, gloss]               |[doughnut, lip, gloss]               |
|12 MESSAGE CARDS WITH ENVELOPES|[12, message, cards, with, envelopes]|[12, message, cards, envelopes]      |
|BLUE HARMONICA IN BOX          |[blue, harmonica, in, box]           |[blue, harmonica, box]               |
|GUMBALL COAT RACK              |[gumball, coat, rack]                |[gumball, coat, rack]                |
+-------------------------------+-------------------------------------+-------------------------------------+
only showi

## 6.3 단어 조합 만들기
- 단어 조합이란 기술적으로 길이가 **n**인 단어의 시퀀스, 즉 **n-gram**으로 간주

In [19]:
from pyspark.ml.feature import NGram

unigram = NGram().setInputCol("DescOut").setN(1)
bigram = NGram().setInputCol("DescOut").setN(2)
unigram.transform(tokenized.select("DescOut")).show(10, False)
bigram.transform(tokenized.select("DescOut")).show(10, False)

+-------------------------------------+-------------------------------------+
|DescOut                              |NGram_88f7dd6c0325__output           |
+-------------------------------------+-------------------------------------+
|[rabbit, night, light]               |[rabbit, night, light]               |
|[doughnut, lip, gloss]               |[doughnut, lip, gloss]               |
|[12, message, cards, with, envelopes]|[12, message, cards, with, envelopes]|
|[blue, harmonica, in, box]           |[blue, harmonica, in, box]           |
|[gumball, coat, rack]                |[gumball, coat, rack]                |
|[skulls, , water, transfer, tattoos] |[skulls, , water, transfer, tattoos] |
|[feltcraft, girl, amelie, kit]       |[feltcraft, girl, amelie, kit]       |
|[camouflage, led, torch]             |[camouflage, led, torch]             |
|[white, skull, hot, water, bottle]   |[white, skull, hot, water, bottle]   |
|[english, rose, hot, water, bottle]  |[english, rose, hot, wate

## 6.4 단어를 숫자로 변환
### CountVectorizer
- Output: (총 어휘 크기, 어휘에 포함된 단어 색인, 특정 단어의 출현 빈도)

In [20]:
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer()\
    .setInputCol("DescOut")\
    .setOutputCol("countVec")\
    .setVocabSize(500)\
    .setMinTF(1)\
    .setMinDF(2)
fittedCV = cv.fit(tokenized)
fittedCV.transform(tokenized).show(10, False)

+-------------------------------+-------------------------------------+---------------------------------------------+
|Description                    |DescOut                              |countVec                                     |
+-------------------------------+-------------------------------------+---------------------------------------------+
|RABBIT NIGHT LIGHT             |[rabbit, night, light]               |(500,[149,185,212],[1.0,1.0,1.0])            |
|DOUGHNUT LIP GLOSS             |[doughnut, lip, gloss]               |(500,[462,463,492],[1.0,1.0,1.0])            |
|12 MESSAGE CARDS WITH ENVELOPES|[12, message, cards, with, envelopes]|(500,[35,41,166],[1.0,1.0,1.0])              |
|BLUE HARMONICA IN BOX          |[blue, harmonica, in, box]           |(500,[10,16,36,352],[1.0,1.0,1.0,1.0])       |
|GUMBALL COAT RACK              |[gumball, coat, rack]                |(500,[228,280,408],[1.0,1.0,1.0])            |
|SKULLS  WATER TRANSFER TATTOOS |[skulls, , water, trans

### TF-IDF

In [21]:
tfIdfIn = tokenized\
    .where("array_contains(DescOut, 'red')")\
    .select("DescOut")\
    .limit(10)
tfIdfIn.show(10, False)

+---------------------------------------+
|DescOut                                |
+---------------------------------------+
|[red, star, card, holder]              |
|[jumbo, bag, red, retrospot]           |
|[red, diner, wall, clock]              |
|[lunch, bag, red, retrospot]           |
|[red, flock, love, heart, photo, frame]|
|[red, woolly, hottie, white, heart.]   |
|[jumbo, bag, red, retrospot]           |
|[edwardian, parasol, red]              |
|[jumbo, bag, red, retrospot]           |
|[red, hanging, heart, t-light, holder] |
+---------------------------------------+



In [24]:
from pyspark.ml.feature import HashingTF, IDF

tf = HashingTF()\
    .setInputCol("DescOut")\
    .setOutputCol("TFOut")\
    .setNumFeatures(10000)
idf = IDF()\
    .setInputCol("TFOut")\
    .setOutputCol("IDFCol")\
    .setMinDocFreq(2)

idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(10, False)

+----------------------------------------+------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|DescOut                                 |TFOut                                                                   |IDFCol                                                                                             |
+----------------------------------------+------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+
|[wrap, red, apples]                     |(10000,[52,3405,9749],[1.0,1.0,1.0])                                    |(10000,[52,3405,9749],[0.0,1.2992829841302609,0.0])                                                |
|[pack, of, 20, napkins, red, apples]    |(10000,[52,429,3405,4495,6547,9012],[1.0,1.0,1.0,1.0,1.0,1.0])          |(10000,[52,429,3405,4

## 6.5 Word2Vec

In [27]:
from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import Vector
from pyspark.sql import Row

documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I with Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

word2Vec = Word2Vec()\
    .setInputCol("text")\
    .setOutputCol("result")\
    .setVectorSize(3)\
    .setMinCount(0)

model = word2Vec.fit(documentDF)
result = model.transform(documentDF)

for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))

Text: [Hi, I, heard, about, Spark] => 
Vector: [0.06367666050791741,-0.047254581749439244,0.0387911431491375]

Text: [I, with, Java, could, use, case, classes] => 
Vector: [0.010081456042826176,-0.08104146995382117,0.07273546925612857]

Text: [Logistic, regression, models, are, neat] => 
Vector: [0.05691406726837159,-0.005072242021560669,-0.059899924695491796]



# 7. 특징 조작

## 7.1 주성분 분석
- 주성분 분석<sup>Principal Components Analysis</sup>은 데이터의 가장 중요한 측면을 찾는 수학적 기법

In [28]:
from pyspark.ml.feature import PCA

pca = PCA().setInputCol("features").setK(2)
pca.fit(scaleDF).transform(scaleDF).show(20, False)

+---+--------------+------------------------------------------+
|id |features      |PCA_b3c7be1d1cac__output                  |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060155975]|
+---+--------------+------------------------------------------+



## 7.2 ChiSqSelector

In [31]:
from pyspark.ml.feature import ChiSqSelector, Tokenizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")

tokenized = tkn\
    .transform(sales.select("Description", "CustomerId"))\
    .where("CustomerId IS NOT NULL")
prechi = fittedCV.transform(tokenized)\
    .where("CustomerId IS NOT NULL")
chisq = ChiSqSelector()\
    .setFeaturesCol("countVec")\
    .setLabelCol("CustomerId")\
    .setNumTopFeatures(2)
chisq.fit(prechi).transform(prechi)\
    .drop("customerId", "Description", "DescOut").show()

+--------------------+----------------------------------+
|            countVec|ChiSqSelector_c638e4b9988e__output|
+--------------------+----------------------------------+
|(500,[149,185,212...|                         (2,[],[])|
|(500,[462,463,492...|                         (2,[],[])|
|(500,[35,41,166],...|                         (2,[],[])|
|(500,[10,16,36,35...|                         (2,[],[])|
|(500,[228,280,408...|                         (2,[],[])|
|(500,[11,40,133],...|                         (2,[],[])|
|(500,[60,64,69],[...|                         (2,[],[])|
|   (500,[264],[1.0])|                         (2,[],[])|
|(500,[15,34,39,40...|                         (2,[],[])|
|(500,[34,39,40,46...|                         (2,[],[])|
|(500,[34,39,40,14...|                         (2,[],[])|
|(500,[34,39,40,14...|                         (2,[],[])|
|(500,[46,297],[1....|                         (2,[],[])|
|(500,[3,4,11,143,...|                         (2,[],[])|
|(500,[6,45,10